Over the past few months, my primary role at work has been that of a bringup engineer: as a new part arrives, from the time the first silicon gets back until we hit production release, I'm involved in the team that shepherds the new chip through everything that needs to happen before we can ship it. This workload is somewhat bursty; some weeks, there will be little to do, but other weeks, I'll spend many many hours fighting fires, upon the discovery that parts of the system have critical flaws in them. Some of these experiences have been qualitatively different from things I've done before, while others have served to reinforce the lessons I learned in the past about how to work through, and ultimately solve, hard problems.
In this article, I've tried to capture and distill some of the lessons to take to heart. I've broken it up into a section containing the two overarching lessons that I found most useful to focus on (and wish that I had followed more in the past!), and then a section containing the remaining bits of strategy that I often otherwise find useful.
The two lessons I found most important came in the form of phrases: one to pay rapt attention to, and one to pay rapt disrespect to. You may well find that these pop up in tougher problems that you might have worked through in the past, perhaps as steps towards the solution; after formulating them, I've found them to be somewhat unifying for my experiences. Hopefully, then, by being aware of these key phrases, difficult problems can be more easily reduced to tractable ones.
Perhaps most apt as an introduction is Lester Freamon's quote from the first season of The Wire:
“We're building something here, Detective. We're building it from scratch. All the pieces matter.”
Freamon, of course, was building a case against a regional drug and crime syndicate, not debugging. At the time he said it, though, there was something of a similarity between the tasks at hand; Detective Pryzbylewski was about to take an important piece of information, and mark it Non-Pertinent – irrelevant. Freamon, in this scene, has picked subtle pieces of information out of the phone call that they were listening to that can make their case for them.
The same thing applies when working to root cause a bug. The title phrase that I allude to – “that's funny” – can be one of the most important. When you hear yourself or someone else remarking on something out of the ordinary, don't just put it aside – find out what's going on! At the very least, write it down, so that it doesn't get lost; perhaps you'll find later that it ties in to confirm a hypothesis.
Three examples of this sort of insight (two from personal experience) come to mind:
In contrast to the previous phrase – something that one should stick one's head out to look for – the phrase “that's impossible” sometimes must be ignored in a difficult debug session. The job of the debugging engineer is to stop at nothing in order to find the root cause of a bug. The request to gather extensive data in a post-mortem session can be a good cause for one to say “that's impossible”. However, even without access to the system's normal facilities for introspection, one might still have to gather data to find a next step to proceed.
I have two cases showing the need to combat “that's impossible”, both from the same bug.
grepmade quick work of it. 2)
In both of these cases, the word “impossible” was actually used to mean “infeasible”. But as a problem gets more difficult, the requirements to judge something infeasible should go way up; extraordinary techniques and deeper dives into the innards of a system may be necessary to solve a problem.
Infeasibility is not the only usage of “that's impossible”, though. The other family of “that's impossible” that one should note – and question! – is when an observed symptom doesn't match with one's mental model as to what could happen. This aligns closely with “that's funny”: rather than rejecting a data point that seems like it could not happen, we should instead focus in on it as much as we can. Drilling down to the details of such misbehaviors increases the depth of our knowledge of a system.
The above two phrases are the overarching things that I really wish I had paid more attention to in the past. However, there are also a few other tricks that I feel are worth mentioning.
If you expect to be running lots of experiments to root cause something, keep a notebook. In fact, even if you don't expect to be running lots of experiments, keep a notebook! A notebook can help you keep track of where you've been, which is important in two ways – it can help you avoid repeating yourself, and it can help you put pieces together when you get stuck.
Keeping a notebook also has the added benefit of forcing you to justify experiments and paths of thought – when you're about to look into something, you get in the habit of writing down why you've done it. In turn, this means that you'll perform fewer 'useless' experiments; if you're not sure what to do next, instead of doing something unproductive to keep yourself occupied, you might instead end up looking at your old data to come up with a better course of action.
If there's variance in how your problem manifests, sometimes it might be useful to let it reproduce a handful of times, and keep a collection of all the syndromes that you see. Are some syndromes more helpful to debug than others? Why is the variance there? What causes the variance? Sometimes, you'll see a syndrome that looks like a smoking gun – in the case of a crash, the crash might occur while the incipient “bad thing” is still happening elsewhere on the system. The more differing data you have, the better you can form and eliminate hypotheses.
A useful abstraction for debugging can be storytelling. In any hard problem, there are two stories to be told: what you believe should be happening, and what is happening. Each can be expressed as a series of steps; at some point, the steps diverge. The problem at hand, then, is to narrow down the divergent step; zone in on it, and pinpoint exactly what went wrong. This is why explaining issues to a coworker or friend – even a nontechnical friend!3) – is often instrumental in debugging, since it causes you to enumerate everything from the beginning. In a case of being truly stuck, you can even write the story down longhand in your notebook, forcing yourself to enumerate every detail.
Sometimes, a problem will come to a head, and it seems like the only hypotheses remaining are truly outlandish. The feeling remains: this can't be it! Unfortunately, we might have nothing better to work with. Luckily, plans of action in debugging that might be destructive are rare, so even if the root cause being traced seems like it can't exist, it can often make sense to go after it.
When you think you have an idea of what to do next, write it down, or say out loud, “My next step is as such, and from that, I expect to have conclusive evidence for such or for such.” Such plans can help you interpret results, and can help you decide what you need to do next.
When all else fails, in the immortal words of Gene Kranz, work the problem. Work from the ground up. Work from the top down. Work from the inside out, or the outside in. Work the problem. There are many ways of attacking a tough issue; try all of them. Above all, don't get buffaloed – if it seems like it's going to take some doing to solve the problem, get doing! Gene Kranz said those words to his team when he had three astronauts stranded in orbit and no clue of what had gone wrong. His team did it: they worked through one of the toughest problems manned spaceflight has experienced, and came out victorious. Use everything at your disposal, and know that you're working towards one goal: a solution.
Hard problems are hard because, well, they're not easy. This might seem obvious, but they require a qualitatively different skillset from the sorts of things that we use every day to deal with minor difficulties. Hard problems require perseverance. Hard problems require throwing every trick in the book at them. Hard problems require an intimate knowledge of the system in which they appear… and sometimes systems outside those, too. Hard problems require the people working with them to be at their best. In this article, I've presented a handful of pieces of advice that can help to develop a methodology for working problems like the toughest I've encountered.
I solicit your feedback – was this helpful? Could I have been more clear in some places? Do you have other poignant examples? Let me know, or feel free to leave comments below.
I would like to thank Ben Wolf for his expert proofreading of this article. (He provided a handful of excellent stylistic suggestions, as well as a handful that I hardheadedly refused to take, so the remaining dysfluencies should be seen as my fault alone.) I would also like to thank Luke Chang, my manager during the time I wrote this article, for his understanding in helping me walk the fine line between providing detail and revealing our trade secrets.