Root Cause Analysis
Mark Kampe
$Id: rootcause.html 150 2007-11-20 01:59:42Z Mark $
You find a bug, you fix it, and you move on to the next bug.
Right? What do you do if you figure out that most of your
bugs are coming from a single person? Wouldn't it be more
efficient to fix the broken person than to continue fixing
their bugs?
Root Cause Analysis
is a critical element of most
continuous improvement methodologies. A few notions
underlie this process:
- process changes are expensive, and we should only
make changes when we have good reason to believe
that the benefits will greatly exceed the costs.
- errors are not random, but stem from specific causes.
- there are chains of causality, and by following them
backwards, we can find common causes to problems that
(initially) seemed to be unrelated.
- the Pareto principle applies to errors, in that 20%
of the causes, give rise to 80% of the problems.
- eliminating the source of many errors has
the potential to be much more effective than finding
and eliminating individual error instances.
There are many approaches to root cause analysis. Some
(e.g. 6 Sigma)
are rich formal methodologies,
while others
(e.g. 5 Whys)
are simple enough to be driven by a two-year old.
Most of them, however, involve:
- statistical analysis to identify clusters of related
problems or attributes that are shared by large numbers
of problems.
- investigation of representative instances (by domain experts)
to follow the chain of causality back to a root cause.
- additional studies to confirm that this root cause did indeed
significantly contribute to a great many problems.
- identify changes to eliminate or control this root cause.
A root cause might be in our process, our materials, the training
of our people, the type of product we are building, the way we
identify potential customers, or any other aspect of our operation.
Root cause analysis does not presume where the roots of the problem
will be found.
We might, for example, start with the observation that we are
having a great many security penetration incidents.
- A study of these incidents might reveal that 99% were
achieved through privileged network daemons.
- A more detailed study of these incidents might reveal
that only 5% were classified as design problems, and
that over 85% were classified as coding problems.
- A review of the code found that 95% of the incidents
characterized as coding problems, involved overflows of
local (on the stack) buffers.
- The stack is the most popular target because overflowing an
array on the stack permits an attacker to change the return
address and thus to execute arbitrary code.
- People put input buffers on the stack because it is easier
than doing dynamic memory allocation.
- Most of the overflows result from reading until a delimiter
is found, without regard to the accumulated length.
- The programmers made these mistakes because they were never
told to watch out for them. The code passed review for the
same reason.
We might consider the last few of these to be root causes of hundreds of
penetration bugs. Once we have identified a few causes that explain
a large number of problems, we can set about attacking those causes:
- we could start working on a prototype study for an auditing
utility to identify code that puts arrays on the stack and
does not appear to have adequate bounds checking.
- we could study the offending code and attempt to define a
new library package to provide more convenient buffer
access functions (a) in dynamically allocated memory
(b) with automatic bounds checking.
- we could add bounds checking to our mandatory issues list
for all code reviews for privileged programs and network daemons.
Sometimes, as Sigmund Freud once conceded,
a cigar is just a cigar. We should, however, stop periodically,
look at what we are doing, and attempt to ascertain whether some of the
problems we are fighting might more effectively be addressed closer to the
source.