Testing and Bug Discovery

Mark Kampe
$Id: testbugs.html 7 2007-08-26 19:52:08Z Mark $

1. Introduction

There are (at least) two major differences between unit testing and system testing:

It is not particularly difficult to define "necessary conditions" for whether or not a program is "good enough". Our system requirements and specifications give us a rich set of properties that the final system must have. Is the sum of these enumerated requirements a sufficient condition for shipability? Unfortunately, testing cannot show the absence of bugs, and the functionality and behavior of many programs is sufficiently rich that we cannot trust a simple list of test cases to adequately measure the product's quality.

These are important questions, and many interesting papers have been (and continue to be) written on this subject. This paper will present introduction to approaches for determining how buggy a product is, and then briefly discuss the nuanced relationship between testing and bug discovery. It is hoped that a better understanding of the relationship between testing and bug discovery will enable you to more effectively plan testing activities, to estimate their efficacy, and to more accurately assess their results.

2. Ship Criteria

If merely having completed all specified test cases cannot be trusted as a sufficient condition for ship-worthiness, what additional criteria must we add? It is common for ship criteria to include additional (more subjective) indicators of product quality. These often include things like:

Test results, combined with additional experiential criteria (such as the above), can give us much greater confidence about the quality of our product. The first three criteria listed above are goals that can be definitively met. The problem with the last two is that they could be satisfied on Monday morning, but no longer hold true a few hours later. We can easily specify criteria based on the number of open bug reports. It is not possible to specify such criteria for the number of bugs in the product ... because we do not know how many bugs we have yet to find.

Since we cannot actually measure the bugs that we have not found, we need some surrogate, that we can measure, and which we believe to be well-correlated with the number of remaining (undiscovered) bugs. Many such surrogates have been proposed:

3. Bug Discovery Rates

Why does this curve seem to have such a predictable shape, and what determines its parameters? This is an interesting question, about which there is still much debate. There are many popular (at least among geeks like me) formulae to describe the efficacy of a testing regimen. A typical one is :

where
  • M(t) is the number of defects (M), discovered by time (t).
  • A is the number of actual defects in the code.
  • C(t) is a coverage function, measuring what fraction of the testable space we will have covered by time (t).
  • K is the efficacy of the testing technique in question (what fraction of present bugs is it expected to find).

    We will refer back to this formula as we try to understand what factors influence bug discovery rate.

    I believe that it is important to look at this curve on two scales:

    The same curve seems to apply equally well on both scales (which is why the distinction is so seldom made), but the underlying causes are quite different ... and those differences have very practical implications.

    3.1 Bug Discovery Under a Particular Test Suite

    One might think that we would run a test suite, find a bunch of bugs, and have found them all. At time t, we run our test suite and we find A * K bugs (K% of the A bugs that are actually present). But it turns out that A (the number of bugs present) is not a constant, and C(t) (our coverage function) is not a spike. There are several factors that lead to the observed distribution:

    1. Ongoing test-case development.
      In new products, the test cases are being written in parallel with the code being tested. All of the test cases are not available on day one. The increase in the rate of bug discovery often parallels the rate of bringing new test cases on-line.
    2. Ongoing product development.
      Very few software products are delivered, complete, into testing. Testing often begins as soon as there is anything to test (to exercise the processes, to test what is available, and to get started finding and fixing problems ASAP). The increasing rate of new bug discovery also often parallels the rate of code delivery.
    3. Blocked test-cases.
      In the early stages of testing, the software may be so failure-prone that some bugs prevent the execution of some test cases (e.g. by causing the program to fail before that test case can be attempted). The rate at which test cases can be run may be limited by the rate at which (already discovered) bugs can be fixed.
    4. Difficult problem isolation
      Bug reports are not filed immediately after a test case fails. The failure must be investigated, to ensure that it is indeed a failure of the tested program and not an error in the test case. Depending on priorities and work loads, this investigation process can spread out over weeks and months ... again limiting the rate at which bug reports can be filed.
    5. Regression
      Not all bug fixes are perfect, and some (regrettably, often a large) fraction of bug fixes will, themselves, be faulty. This means that new bugs will continue to be introduced into the product (and thus become discoverable) even after development was nominally completed.
    6. Non-deterministic failures
      It is tempting to think of a test suite as a simple sequence of well scripted test cases:
      • prepare the inputs
      • invoke the operation
      • check the results against the specifications
      For such a test suite (e.g. functionality, error handling, regression testing), running it once should uncover all of the bugs that it will ever find. There are other types of tests (e.g. random scenarios, load and stress testing) where this is not at all the case. For these (non-deterministic) tests, running them longer yields greater confidence, which yields the exponential decay in the bug report rate as testing continues.

    3.2 Bug Discovery and the Release Cycle

    The previous discussion was limited to bug discovery resulting from the execution of a single test suite. If we now zoom out to view the entire process from development through release, we see a very similar distribution of bug reports ... but for very different reasons.

    Products are "tested" in a great many different ways:

    Each of these "testing" techniques is very different from the others, and is likely to find very different problems. Returning to our defect discovery equation, each of these techniques has a different coverage function (C(t)) and efficacy (K). Moreover, each of these testing phases (especially system testing) may, itself, be comprised of many different types of tests, each of which has its own coverage and efficacy functions.

    The "whole product" bug discovery curve is actually the sum of these tributary curves. Each of the tributary curves tends to follow a similar distribution (with different parameters) for the reasons described in the previous section. It might not, however, be immediately obvious why the sum of a series of such curves should have the same distribution:

    1. If you look carefully, they don't.
      The bug arrival rate curves for real products do experience periodic bumps (associated with new testing activities), but each bump is still followed by an exponential decay.
    2. There are, however, other factors that tend to replicate a similar distribution on the larger scales.
      • In the earliest stages of the product, there is no code and so there is little basis on which bug reports could be filed.
      • As code becomes available, and test cases are run, the bug arrival rate ramps up.
      • Each successive testing activity removes more defects, and so the number of residual defects does experience an exponential decay.
      • The deterministic test cases are defined and run relatively early in the process. They play a very significant role in the initial ramp-up of bug-discovery, but quickly cease to be a source of new bugs.
      • After significant testing and use the product quickly reaches the point where most of the remaining bugs are:
        • in untested and infrequently used operations
        • in unlikely error conditions
        • in unlikely timing and interaction problems
        And the detection rate for these bugs should indeed decay exponentially (as increasing exposure gives asymptotically better confidence).

    4. Testing and Bug Discovery

    Different types of testing find different types of bugs. The bug discovery rate equation attempts to capture this with its efficacy constant (K).

    Thus, our expectation of how many (new) bugs we will find when we begin the next phase of testing depends on how similar the new phase of testing is to testing that has already been done. If we are to begin exercising the system in significantly different ways, we should not be surprised if we experience a significant increase in the number of bugs we discover. If we want to find more bugs than we have already found, we are going to need to do a different type of testing.

    Different types of testing require different amounts of time. The bug discovery rate equation attempts to capture with its coverage function (C(t)):

    Understanding this should give us some basis for deciding how long we need to run a particular type of test, before we have "squeezed the juice out of it". Re-running a functional validation twice might be a waste of cycles, whereas running a load-stress suite for a second billion transactions might add considerable value.

    Don't make the mistake of blindly assuming that the number of test cases that have been run, or the number of hours that a system has been tested can be used to predict its quality. How efficacious do we believe these test cases are likely to be in finding the bugs we think we are likely to have, and how much coverage do we think that our hours of testing have bought us?

    5. Testing and Confidence

    Insight into the shape of real bug discovery curves is a good thing, but what we really need is to be able to predict the discovery rates for future products. More specifically, we want to know how well we can use experience gained in unit and system testing to predict the rates at which customers will encounter problems.

    I suggest that it is misleading to think about software in terms of an abstract number of residual defects (bugs not yet found):

    1. the shape of the defect discovery curve is only a general trend, and results of extrapolating such a curve are imprecise.
    2. even if we knew how many residual defects were in the code, this would not enable us to predict the rate at which customers would encounter them or the severity of the resulting errors.
    Rather, I think it is more useful to think about quality assessment in terms of confidence. How likely is it that our current estimate of the product quality will match what the customers actually experience? If we believe that customers will use the system in ways that are very similar to the ways that we have tested the system, then we should expect the customer experience to be an extrapolation of the current system testing results. If, however, we believe our testing methodology to be artificial (i.e. not particularly representative of the way that customers will use the product), then we should expect a significant increase in the number of reported problems when the product goes out to real users.

    If we do not believe (or are not confident) that our testing is representative of (or at least well correlated with) the way the product will actually be used:

    So how are we supposed to know how representative our testing is of the way that customers will use our software?

    Like so many things, Sun Tzu clearly understood this when he advised us: