Root Cause Analysis

Mark Kampe

A man was walking by the side of a river and saw an out of breath woman swiming out to rescue a baby floating in the river. There were a half dozen babies already on the shore and she screamed for him to help her save them. He turned and started running upstream. She shouted "you must help me save these babies!" He shouted back "I'm going to go find the person who is throwing them in the river!".

You find a bug, you fix it, and you move on to the next bug. Right? What do you do if you figure out that most of your bugs are coming from a single person or process? Wouldn't it be more efficient to fix the broken person or process than to continue fixing their bugs? Tolstoy may have been right about every unhappy family being unique, but it is regularly observed that a great many bugs seem to be traceable to a few common causes. There are many such lists but a few common ones are:

It is important to realize that these are not random mistakes resulting from cosmic-ray hits to programmer neurons. These are systematic mistakes resulting from incomplete understanding and/or unsound processes. McConnell's "Code Complete" is an effort to address many of the most common and troublesome of these problems. But even if we fully mastered all of these lessons, we would merely go on to find new mistakes to make.

Root Cause Analysis is a critical element of most continuous improvement methodologies. A few notions underlie this process:

There are many approaches to root cause analysis. Some (e.g. 6 Sigma) are rich formal methodologies, while others (e.g. 5 Whys) are simple enough to be driven by a two-year old. Most of them, however, involve:

A root cause might be in our process, our materials, the training of our people, the type of product we are building, the way we identify potential customers, or any other aspect of our operation. Root cause analysis does not presume where the roots of the problem will be found.

We might, for example, start with the observation that we are having a great many security penetration incidents.

We might consider the last few of these to be root causes of hundreds of penetration bugs. Once we have identified a few causes that explain a large number of problems, we can set about attacking those causes:

Sometimes, as Sigmund Freud once conceded, a cigar is just a cigar. We should, however, stop periodically, look at what we are doing, and attempt to ascertain whether some of the problems we are fighting might more effectively be addressed closer to the source.