On Solving Hard Problems

Over the past few months, my primary role at work has been that of a bringup engineer: as a new part arrives, from the time the first silicon gets back until we hit production release, I'm involved in the team that shepherds the new chip through everything that needs to happen before we can ship it. This workload is somewhat bursty; some weeks, there will be little to do, but other weeks, I'll spend many many hours fighting fires, upon the discovery that parts of the system have critical flaws in them. Some of these experiences have been qualitatively different from things I've done before, while others have served to reinforce the lessons I learned in the past about how to work through, and ultimately solve, hard problems.

In this article, I've tried to capture and distill some of the lessons to take to heart. I've broken it up into a section containing the two overarching lessons that I found most useful to focus on (and wish that I had followed more in the past!), and then a section containing the remaining bits of strategy that I often otherwise find useful.

Phrases of note¶

The two lessons I found most important came in the form of phrases: one to pay rapt attention to, and one to pay rapt disrespect to. You may well find that these pop up in tougher problems that you might have worked through in the past, perhaps as steps towards the solution; after formulating them, I've found them to be somewhat unifying for my experiences. Hopefully, then, by being aware of these key phrases, difficult problems can be more easily reduced to tractable ones.

"That's funny."¶

Perhaps most apt as an introduction is Lester Freamon's quote from the first season of The Wire:

"We're building something here, Detective. We're building it from scratch. All the pieces matter."

Freamon, of course, was building a case against a regional drug and crime syndicate, not debugging. At the time he said it, though, there was something of a similarity between the tasks at hand; Detective Pryzbylewski was about to take an important piece of information, and mark it Non-Pertinent -- irrelevant. Freamon, in this scene, has picked subtle pieces of information out of the phone call that they were listening to that can make their case for them.

The same thing applies when working to root cause a bug. The title phrase that I allude to -- "that's funny" -- can be one of the most important. When you hear yourself or someone else remarking on something out of the ordinary, don't just put it aside -- find out what's going on! At the very least, write it down, so that it doesn't get lost; perhaps you'll find later that it ties in to confirm a hypothesis.

Three examples of this sort of insight (two from personal experience) come to mind:

In the golden years of America's space program, Apollo 12 took a lightning strike during launch². Immediately afterward, many of the command module's systems began to misbehave in a confusing pattern that indicated wide systemic failure -- it appeared to be a product of multiple unrelated components all failing as a result of the strike. The mission was nearly deemed unrecoverable, and if no action was taken, would have become the Apollo program's only abort before entering orbit. However, one engineer called out to switch a power supply rail to auxiliary power; immediately, all of the apparently failed systems began reporting correct telemetry once again.

This engineer solved the problem because he had traced down a similar scenario in a ground test many missions before. He saw a similar pattern of failures that seemed to be unrelated; in that case, however, a different test was being performed, and the navigation system failures were ignored. Later that week, he traced the issue down to a disconnected auxiliary power supply on that rail in simulation; when he flipped the switch back to "main" on the test module, the problem disappeared. In this case, the protocol of actively seeking to root cause what they called "funnies" saved a mission just a year or two later.
At one employer, I was called in to look at an issue in which a system worked OK with one version of a chip, but did not work with another version of what should have been the same chip. As far as we knew, the new version had a change in one of the pads, and nothing else. We knew that the system sometimes didn't boot at all, but what we had forgotten about was that sometimes, while the system was booting, driver initialization would take a long time, producing an error before continuing on. At the time, we chalked that up to general instability in preproduction drivers, but we had never seen that message before.

We ultimately root caused the issue to an error in which we had also changed another similar pad. After we had independently diagnosed the issue, we realized that the message pointed directly to the specific inadvertent pad change, and an error condition that such a thing could trigger. Had we looked at the details of that message earlier, we would have been spared some hours of frustrating and stressful debugging.
In college, I took a class that was the closest thing our computer science department had to a capstone class: Operating System Design and Implementation. (It ultimately proved to be a formative experience for me, and I went on to TA it for quite some time.) One of the projects in this course is to write a kernel. During this project, I came across a peculiar issue: under particular types of loads, the virtual machine's display would fill with garbage, and the machine would crash. The crash was a "triple fault" -- something went very wrong with the system. By that point, the virtual machine's state was corrupted and no registers were available, since the triple fault caused the virtual CPU to reset. \ \ In retrospect, I wish I had taken note of the particular contents of the display. What I had previously believed to be garbage, surely, was not; it contained important information about what the system was doing at the time. The exact values of the garbage were meaningful! At the time, I wrote it off as simply a strange part of the crash syndrome, but there was much more to it than that. Paying attention to weird data would have saved a day or two of debugging.

"That's impossible."¶

In contrast to the previous phrase -- something that one should stick one's head out to look for -- the phrase "that's impossible" sometimes must be ignored in a difficult debug session. The job of the debugging engineer is to stop at nothing in order to find the root cause of a bug. The request to gather extensive data in a post-mortem session can be a good cause for one to say "that's impossible". However, even without access to the system's normal facilities for introspection, one might still have to gather data to find a next step to proceed.

I have two cases showing the need to combat "that's impossible", both from the same bug.

In this bug, a graphics processor appeared to be disappearing from the system's bus -- the CPU couldn't reach it while the program was executing. This usually took some 16 hours to reproduce, so as much data as possible needed to be extracted from each repro session. Unfortunately, we still had no indicators that the issue was about to recur at that point, so any data collection must have been done post-mortem. The first time we found a particularly interesting syndrome after reproducing the issue, we wanted more details about what had happened, but the system did not have a debug build of the driver.

The standard assertion is that it's "impossible" to extract meaningful data from a driver's release build. Unfortunately, in this case, impossible simply wouldn't do; in order to make meaningful progress on this bug, we needed the data from this specific repro case. So, we started from basics -- we knew what was in the registers, and we knew that the data that we wanted was somewhere in memory, so we dug through data structures until we found an offset to the GPU's physical memory, and reconstructed our mental model of the machine's state from there.
We also had a piece of vendor code that we suspected of doing something specific to destabilize the system. We mentioned this to the vendor, who would not provide us with source (this binary contained other proprietary information). Again, one might assert that it is "impossible" to extract substantial knowledge about what a program does from only a binary; some time with objdump and grep made quick work of it. ³

In both of these cases, the word "impossible" was actually used to mean "infeasible". But as a problem gets more difficult, the requirements to judge something infeasible should go way up; extraordinary techniques and deeper dives into the innards of a system may be necessary to solve a problem.

Infeasibility is not the only usage of "that's impossible", though. The other family of "that's impossible" that one should note -- and question! -- is when an observed symptom doesn't match with one's mental model as to what could happen. This aligns closely with "that's funny": rather than rejecting a data point that seems like it could not happen, we should instead focus in on it as much as we can. Drilling down to the details of such misbehaviors increases the depth of our knowledge of a system.

Tricks and techniques¶

The above two phrases are the overarching things that I really wish I had paid more attention to in the past. However, there are also a few other tricks that I feel are worth mentioning.

Keep a notebook.¶

If you expect to be running lots of experiments to root cause something, keep a notebook. In fact, even if you don't expect to be running lots of experiments, keep a notebook! A notebook can help you keep track of where you've been, which is important in two ways -- it can help you avoid repeating yourself, and it can help you put pieces together when you get stuck.

Keeping a notebook also has the added benefit of forcing you to justify experiments and paths of thought -- when you're about to look into something, you get in the habit of writing down why you've done it. In turn, this means that you'll perform fewer 'useless' experiments; if you're not sure what to do next, instead of doing something unproductive to keep yourself occupied, you might instead end up looking at your old data to come up with a better course of action.

Collect syndromes.¶

If there's variance in how your problem manifests, sometimes it might be useful to let it reproduce a handful of times, and keep a collection of all the syndromes that you see. Are some syndromes more helpful to debug than others? Why is the variance there? What causes the variance? Sometimes, you'll see a syndrome that looks like a smoking gun -- in the case of a crash, the crash might occur while the incipient "bad thing" is still happening elsewhere on the system. The more differing data you have, the better you can form and eliminate hypotheses.

Tell a story.¶

A useful abstraction for debugging can be storytelling. In any hard problem, there are two stories to be told: what you believe should be happening, and what is happening. Each can be expressed as a series of steps; at some point, the steps diverge. The problem at hand, then, is to narrow down the divergent step; zone in on it, and pinpoint exactly what went wrong. This is why explaining issues to a coworker or friend -- even a nontechnical friend!¹ -- is often instrumental in debugging, since it causes you to enumerate everything from the beginning. In a case of being truly stuck, you can even write the story down longhand in your notebook, forcing yourself to enumerate every detail.

Pick a plan.¶

Sometimes, a problem will come to a head, and it seems like the only hypotheses remaining are truly outlandish. The feeling remains: this can't be it! Unfortunately, we might have nothing better to work with. Luckily, plans of action in debugging that might be destructive are rare, so even if the root cause being traced seems like it can't exist, it can often make sense to go after it.

When you think you have an idea of what to do next, write it down, or say out loud, "My next step is as such, and from that, I expect to have conclusive evidence for such or for such." Such plans can help you interpret results, and can help you decide what you need to do next.

Work the problem.¶

When all else fails, in the immortal words of Gene Kranz, work the problem. Work from the ground up. Work from the top down. Work from the inside out, or the outside in. Work the problem. There are many ways of attacking a tough issue; try all of them. Above all, don't get buffaloed -- if it seems like it's going to take some doing to solve the problem, get doing! Gene Kranz said those words to his team when he had three astronauts stranded in orbit and no clue of what had gone wrong. His team did it: they worked through one of the toughest problems manned spaceflight has experienced, and came out victorious. Use everything at your disposal, and know that you're working towards one goal: a solution.

Concluding thoughts¶

Hard problems are hard because, well, they're not easy. This might seem obvious, but they require a qualitatively different skillset from the sorts of things that we use every day to deal with minor difficulties. Hard problems require perseverance. Hard problems require throwing every trick in the book at them. Hard problems require an intimate knowledge of the system in which they appear... and sometimes systems outside those, too. Hard problems require the people working with them to be at their best. In this article, I've presented a handful of pieces of advice that can help to develop a methodology for working problems like the toughest I've encountered.

I solicit your feedback -- was this helpful? Could I have been more clear in some places? Do you have other poignant examples? Let me know, or feel free to leave comments below.

joshua

Acknowledgements¶

I would like to thank Ben Wolf for his expert proofreading of this article. (He provided a handful of excellent stylistic suggestions, as well as a handful that I hardheadedly refused to take, so the remaining dysfluencies should be seen as my fault alone.) I would also like to thank Luke Chang, my manager during the time I wrote this article, for his understanding in helping me walk the fine line between providing detail and revealing our trade secrets.

Indeed, maybe an inanimate object; beware, however, that there can be fierce debate as to whether a cardboard dog or a rubber duck is ideal for such debugging practices. ↩
For more stories like this, the interested reader might find Gene Kranz's memoir, Failure Is Not an Option, to be fascinating additional reading. His book served as inspiration in part for this article. ↩
Truly, extracting information about a program from a binary is not only possible, but it's regularly done. Reverse engineering is a field that employs many, and there is a lot of information about it out there! For folks not in that business, though, it can seem somewhat foreign as a concept for learning more about friendly binaries, instead of malware... ↩