Root Cause Analysis

Mark Kampe
$Id: rootcause.html 150 2007-11-20 01:59:42Z Mark $

You find a bug, you fix it, and you move on to the next bug. Right? What do you do if you figure out that most of your bugs are coming from a single person? Wouldn't it be more efficient to fix the broken person than to continue fixing their bugs?

Root Cause Analysis is a critical element of most continuous improvement methodologies. A few notions underlie this process:

process changes are expensive, and we should only make changes when we have good reason to believe that the benefits will greatly exceed the costs.
errors are not random, but stem from specific causes.
there are chains of causality, and by following them backwards, we can find common causes to problems that (initially) seemed to be unrelated.
the Pareto principle applies to errors, in that 20% of the causes, give rise to 80% of the problems.
eliminating the source of many errors has the potential to be much more effective than finding and eliminating individual error instances.

There are many approaches to root cause analysis. Some (e.g. 6 Sigma) are rich formal methodologies, while others (e.g. 5 Whys) are simple enough to be driven by a two-year old. Most of them, however, involve:

statistical analysis to identify clusters of related problems or attributes that are shared by large numbers of problems.
investigation of representative instances (by domain experts) to follow the chain of causality back to a root cause.
additional studies to confirm that this root cause did indeed significantly contribute to a great many problems.
identify changes to eliminate or control this root cause.

A root cause might be in our process, our materials, the training of our people, the type of product we are building, the way we identify potential customers, or any other aspect of our operation. Root cause analysis does not presume where the roots of the problem will be found.

We might, for example, start with the observation that we are having a great many security penetration incidents.

A study of these incidents might reveal that 99% were achieved through privileged network daemons.
A more detailed study of these incidents might reveal that only 5% were classified as design problems, and that over 85% were classified as coding problems.
A review of the code found that 95% of the incidents characterized as coding problems, involved overflows of local (on the stack) buffers.
The stack is the most popular target because overflowing an array on the stack permits an attacker to change the return address and thus to execute arbitrary code.
People put input buffers on the stack because it is easier than doing dynamic memory allocation.
Most of the overflows result from reading until a delimiter is found, without regard to the accumulated length.
The programmers made these mistakes because they were never told to watch out for them. The code passed review for the same reason.

We might consider the last few of these to be root causes of hundreds of penetration bugs. Once we have identified a few causes that explain a large number of problems, we can set about attacking those causes:

we could start working on a prototype study for an auditing utility to identify code that puts arrays on the stack and does not appear to have adequate bounds checking.
we could study the offending code and attempt to define a new library package to provide more convenient buffer access functions (a) in dynamically allocated memory (b) with automatic bounds checking.
we could add bounds checking to our mandatory issues list for all code reviews for privileged programs and network daemons.

Sometimes, as Sigmund Freud once conceded, a cigar is just a cigar. We should, however, stop periodically, look at what we are doing, and attempt to ascertain whether some of the problems we are fighting might more effectively be addressed closer to the source.