Forensic Debugging

Mark Kampe

1. Introduction

Usually debugging is an interactive process. If you observe anomalous behavior in a program, you might simply try to reproduce the problem and observe the situations under which it does and does not behave correctly. Often, if you are familiar with the code, this alone is enough to identify the cause of the problem. If the precipitating circumstances don't provide enough information to reveal the problem, you might try running your program under a system call tracing framework to see what it was actually doing:

Sample strace output:

If this still not sufficient to reveal the problem you could try enabling (or adding) diagnostic instrumentation, or running your program under a debugger so that you can observe its progress and changing state. Modern tracing and source level debugging tools make it very easy to quickly zero in on the causes of most failures, and have greatly improved the productivity of developers. Unfortunately, interactive debugging is not always possible:

The fact that a problem happens infrequently does not mean that it is not important. How would you feel about a bug in the collision avoidance software on a commercial jet that, under an extremely unlikely set of circumstances would veer towards the hazard rather than away from it? There are many product for which no undiagnosed failures are acceptable. If a system fails, we must determine why and fix it. If we cannot reproduce and interact with the failure, then we will have to diagnose the failure on the basis of the data collected at the time.

Sir Arthur Conan Doyle (through Sherlock Holmes), first popularized the notion that, through the process of logical deduction, an insightful investigation of physical evidence could enable us to make valid inferences about past events. Forensic investigation is a collection of scientific techniques for establishing (after an incident) facts to be presented in a court of law (from the Latin fornsis, meaning a public forum). The term is now used, more generally, to any after-the-fact, scientific investigation of "what happened". We use the term forensic debugging to describe processes for diagnosing irreproducible software failures on the basis of data collected at the time of the failure.

In by-gone ages, before good debuggers and cheap computer time, this was they way that most debugging had to be done. No one would want to go back to those dark ages, but it is occasionally unfortunate that modern programmers seldom have the opportunity to develop such skills ... because we still have occasional need for them.

2. Sources of Information

I have often likened the plight of a forensic debugger to the (every day) situation of a cosmologist. Cosmology is not an experimental science. The processes they study do not happen very often (perhaps only once) and are (in general) not amenable to recreation in the laboratory. All of the data we will ever have has already been sent to us. The good news (and the bad news too) is that there is quite a lot of data to sort through. Their task is a cycle of

Our situation is actually much more reasonable than theirs:

If cosmologists think they can figure out where the universe came from, we can surely figure out why our program crashed. What are the sources of information from which we can draw our inferences and seek confirmation?

2.1 Core Dumps

In many cases the primary (and most valuable) information available to us is in the form of core-dumps. Perhaps a key program died (e.g. with a segmentation violation), or perhaps the core dumps were forced (manually or automatically) to capture the state of a system that didn't seem to be working properly.

In some cases, the core dump immediately explains the entire problem, because we can see (e.g. on the stack) the entire sequence of events that led to the failure. In other cases, the coredump is merely the corpse of an unfortunate victim (not unlike the victim of a freeway sniper). The most we can hope to learn from the corpse is the caliber of the weapon that was used and the direction from which it was fired.

2.1.1 Stack Traces

One of the most interesting things we can get is a stack trace, a list of all of the subroutine calls (and if we are lucky parameters) that were on the stack at the time of death. For interpreted languages the stack trace may be produced directly by the interpreter.

Example: Java stack trace

Example: Python stack trace

For compiled languages it is usually necessary to run a debugger to interpret (with the help of a symbol table from the running program) the contents of a core dump.

Example: C stack trace with parameters and line numbers

The debugger in the above case was able to determine the parameters and line numbers because the program had been compiled with debug symbols enabled. Without this information, a stack trace may be much cruder:

Example: C stack trace with only global symbols

It is often possible to infer the cause of the error directly from the stack trace. If, for instance, the failure was precipitated by an addressing error, resulting from trusting a bad argument, we can usually see the whole history of where that argument came from, and can often infer the cause of the problem from this information alone. Even if the cause of the corruption is not immediately obvious from the stack trace, knowing what routine we were in, and the sequence of calls that led us here can often tell us a great deal about the sequence of events leading up to the crash.

2.1.2 Contents of Variables

If merely seeing the sequence of calls and parameters is not enough to enable us to infer the cause of a failure, going back and forth between the code and the contents of variables will usually paint a pretty complete picture of what the program thought it was supposed to be doing, who was just passing on parameters, and who it was who first found the bogus value. Since the core dump contains all of the data and stack segments of the aborted process, we can usually use a debugger to show us the contents of any global variable, and any local variable within any stack frame. Cruder debuggers may give us only hex dumps, from which we must infer the contents of specific fields. Some debuggers can print out formatted snapshots of even very complex data structures:

2.1.3 Patterns in the Garbage

One of the worst kinds of problems to track down is stray stores, where somebody picks up a bad pointer (e.g. because of improper variable initialization or clean-up) and stores through it, losing his own data and destroying someone elses in the process. These problems usually turn up millions (or billions) of instructions later when some innocent victim attempts to use the corrupted data and dies because of its inappropriate contents. The victims are easy to find, but what we are really looking for is the psychopath who is storing his values through bad pointers.

In Arlo Guthrie's epic ballad of "Alice's Restaurant", the story teller, who had illegally dumped a pile of garbage somewhere, was tracked down on the basis of an address found on an envelope at the bottom of the pile. That is often a good metaphor for forensic debugging.

When we say "garbage" we may think of a random combination of bits ... but if you look at the data in any process' address space, there are no "random combinations of bits". Every combination of bits is unique and (to somebody) precious. The bugs that result in such corruption may be randomly distributed through our code, but there is nothing random about the corruption itself:

Studying these characteristics will tell us a great deal about the source of the corruption (which fields of what data structure they seem to be). Knowing what data has been stored (in the wrong place) may suggest a (probably very) small number of places that store those fields as likely suspects.

2.2 Files

The state of a program execution is only captured by the current stack contents and variables. This is fine if the broken code is still on the stack when the core dump is taken, but if the code that blew up was the innocent victim of an incident that happened long ago, the stack-trace may provide very little information about the original cause of the problem. In such situations we need to seek other, less direct, evidence.

Programs often operate on files, which may record that data that drove the program, and the output that the programs created along the way. The input files may help us to understand what the program was doing shortly before it encountered the problem. The output may help us to understand both what the program was doing, and more precisely when the problem took place (e.g. the point at which invalid output begins).

2.3 Logs

Numerous programs maintain logs of interesting events. Some log entries may be permanent records of billable events (e.g. phone calls) or that may be subject to subsequent audit (e.g. who entered which room when). Others are maintained to support service diagnosis (e.g. we log junk e-mail discards to help us track down problems in our filtering rules). Some logs are intended to capture client behavior (e.g. web service requests or file system traffic) for subsequent modeling or analysis. Some logs are maintained specifically to facilitate post-mortem analysis of anomalous behavior. Most major operating systems offer extensive facilities for both the capturing and management of logged information. See, for instance:

Most file I/O is bufferred (written into a buffer, until the buffer is full, at which point it is flushed out to disk). This bufferring greatly reduces the expense of logging, but comes at a price. If a process (or the OS) crashes, the last buffer full of data may not have been flushed out to disk, and so the log file may be incomplete. If there is a core dump, however, it should be possible to find the last few log entries still in their in-memory buffer.

There is always a trade-off to be made in creating diagnostic logs:

One common approach to resolving this tension is to make frequent log entries into a large, in memory, circular buffer. We might, for instance, dedicate a megabyte of memory to record the last 100,000 operations. When everything is working, we continue to reuse the same log space over and over. If the system ever crashes, we will find a record of the last 100,000 operations waiting for us in the core dump. The cycle and memory cost for such logging is modest ... but the potential benefits are very great. The choice of the size for such a log is critical. If it is too small, important information about the initial cause may have been recycled by the time the coredump is finally triggered.

3. Inferences, Hypotheses & Confirmations

If we are lucky, it will be obvious why the program died, and the stack-trace, and variable contents (and, of course, the code) will permit us to work our way back from the point of failure to the original error. Once we find the defective code, it is obvious why that code would not have worked in this particular situation. We can see the combination of circumstances that exercised the defective code, and may then be able to recreate the failure at will. That is, if we are lucky.

It is not always so. Sometimes, the above exercise leads us to a non-primary cause:

In all of these cases, the behavior of the failed program was reasonable, considering what we know to have happened. The problem is that we cannot explain how the circumstances that precipitated the failure could have come about. Unfortunately, the code that caused the problem is no longer on the stack, and we are forced to use inference (rather than deduction) to identify it. This process can be extremely difficult, but it can also be a lot of fun and very rewarding:
  1. observe anomalies
  2. formulate hypotheses to explain them
  3. predict other (testable) implications of those hypotheses
  4. make observations to confirm or refute them
  5. integrate the newly gained information into our model
  6. repeat until the problem is solved
When tackling such a foe, there is no substitute for a thorough understanding of how the software in question is supposed to work:

There may be some innate gifts that make some people good at this process, but it clearly requires a great deal of knowledge, and skills that are developed through experience. It was once said that you will know you have arrived as a "Kernel Hacker" when you can correctly diagnose and explain a race condition over the phone.

Often, it may seem that we have exhausted all possible explanations, and that the situation we observe is impossible. At such times it may be valuable to go back to Sherlock Holmes who oft reminded us:

If there are no possible explanations for your symptoms, the most likely answer is that you have assumed something to be impossible that is not. Go back and revisit your basic assumptions, and look for ways to confirm or refute their truth. You will often find that something that you believed to be impossible has indeed taken place. Learning to recognize and release "assumptions about what is possible" is part of the path. You might recall the great Vizzini's inability to "conceive" of the fact that anyone could be able to discern and interfere with his masterful plan, culminating at the top of the cliffs of death:

An open-minded approach to debugging is much better characterized by Hamlet's advice to Horatio regarding the source of his recent revelations:

What we want to get out all this reverie and detective work is confirmation of our hypothesis:

Sometimes the circumstances of a bug are extremely difficult to precipitate and/or the consequences too complex to enable clean predictions, but this is the level of certainty for which we should strive.

4. Anticipating Future Needs

A good architecture addresses all of a system's requirements. When we are designing a complex system, we consider the likely modes of failure, and attempt to prevent such problems or to construct fire walls to limit their potential impact. We must also consider future problems and more complex modes of failure, and ask ourselves what information we might need to diagnose them. This consideration may lead us to:

This is another huge advantage we have over the cosmologists. They were not consulted, prior to the creation of the universe, about what information would be useful in unraveling the process. We, on the other hand, have the ability to include any instrumentation we want in our creations. The information that will be available to us when we have to diagnose a failure, after the fact, is almost entirely under our control. This is a tremendous power. Use it every chance you get. You'll be glad you did.