If we are to improve the reliability and robustness of our software, we must be able to intelligently discuss these things, and to distinguish between distinct but related concepts. The following is a list of basic terms in reliability engineering. These terms are often mis-used, even by experienced engineers. It is important that you understand the definitions of these terms and be able to use them correctly. This will enable you to to think and communicate more effectively.
In software, defects are most commonly referred to as bugs. There is nothing wrong with calling defects bugs, but make certain that you do not use this term to refer to faults, errors, or failures.
It should, also, be noted that Grace Hopper's original bug was not defect but rather a fault.
An incident where a defect is exercised, causing a component to malfunction (not function as intended). Depending on the nature of the underlying defect, the incident may be precipitated by normal use, unusual use, external events, or a complex combination of such factors.
A defect is merely predisposition to error. If the defect is never put to the test, the component could go its entire life-time without ever malfunctioning.
Note that the use of the word fault in general failure terminology is awkwardly similar (but not identical) to its use in software: the process whereby a computer detects and notifies software of a malfunction or execution failure.
An incident where a component malfunctions (produces unexpected output). This usually the result of a fault occurring in a defective component.
The term "unexpected" is relative. One might consider throwing a zero-divide exception to be an "unexpected" result during the evaluation of an arithmetic expression, even though the computer instruction set and language both specify this behavior in this situation. In this situation a defect in a program caused a fault in its execution, which was detected by the CPU, and reflected (back to the program) as an error.
An incident where a system fails to provide services as expected.
The relationship between errors and failures is a nuanced one:
Thus, in the zero-divide example, throwing an exception does not represent a failure of the CPU, compiler, or operating system. All are providing service exactly as they were specified to do.
Robust systems have the ability to recovery from errors. If the program that experienced the zero-divide-exception caught it, sent back an error message complaining of invalid input, and then continued correctly processing new requests, it would have experienced an error but not a failure.
Robust systems may have the ability to repair failed components, to switch operation over to spare components, or to continue functioning with reduced capacity. If a web server fails and a front-end switch detects this (e.g. by noting that it took more than five seconds to respond to the last request), the switch could retransmit the request to a different server. Even though there was a complete failure of a key resource, the user might not even experience an error ... but merely a briefly delayed response.
The degree to which a component is free of defects.
The ability of a component, or system of components to avoid failure in the face of faults and errors.
This is not at all the same as correctness, but rather is a complementary property. A relatively correct system may have very few defects. A robust system may have defects, be subject to faults, and experience errors ... but still avoid failure through some combination of detection, correction, repair, and redundancy.
Many engineers believe robustness to be both more useful and more achievable than correctness. This is, in part, due to the difficulty of achieving perfect correctness, but in greater part due to the fact that a robust system is more likely to survive in the face of unanticipated errors.
The likelihood that a component or system will not fail during a specified period of time. Reliability is typically quantified in one of three ways:
For components with a rated life, that may be the implied time period. Thus, if a light bulb with a rated 10,000 hour life, had a 1.5% chance of burning out in the first 10,000 hours, we would say that its reliability was 98.5% (over its rated life).
For components that do not experience normal aging, a specific time period can be specified. My Windows 98 systems had about a 50% chance of staying up for one day, but my Windows XP system seems to have a 90% chance of staying up for a week or more.
For components that are expected to fail during their life times, we can multiply the time period by the probability of failure, and can obtain a statistical average failure interval.
My Windows 98 system had an MTTF of 4 hours. My XP system has an MTTF in excess of 150 hours.
For components that are not expected to fail during their life times, we can extrapolate the probability of failure to a very large population (or time period) and look at the number of expected failures per billion hours.
The soft error rate associated with cosmic ray hits to semi-conductor memory is in on the order of 100 FITs/megabit.
Assuming standard distributions, these three quantifications should all be interchangeable (in that any could be computed from the others). In the last 10-20 years, however, availability analysts have started moving away from the most venerable of these forms (MTTF):
For components that are not expected to fail during their rated lifetimes, MTTF, while statistically meaningful, is highly misleading.
For components that are likely to experience failure during their rated lives, but can be repaired (restored to service after a failure) it is meaningful to talk about the likelihood that a component or system will be providing service at any particular instant (in steady-state operation over a long period of time). The availability of a system is a function of both its failures and its repairs.
The quantification of failures was described above (under reliability). Repair can also be quantified by either a mean time (Mean Time To Repair) or as a rate (e.g. number of repairs per billion hours).
Availability is typically quantified in one of two ways:
In most cases this is equal to the MTTF. There are, however, situations where the Mean Time to first Failure is very large, but the Mean Time to subsequent Failures is much smaller. In such cases, the steady-state MTBF would be equal to the Mean Time to subsequent Failures.
Av = expected up-time / ( expected up-time + expected down-time)
which can be approximated as MTBF / (MTBF + MTTR)
Availability is also (colloquially) expressed in nines:
| availability | nines | annual down time |
|---|---|---|
| 99% | 2 | 3.6 days |
| 99.9% | 3 | 8.8 hours |
| 99.99% | 4 | 52 minutes |
| 99.999% | 5 | 5 minutes |
| 99.9999% | 6 | 31 seconds |
| 99.99999% | 7 | 3 seconds* |
*in your dreams!
Availability is affected by reliability (in that more failures imply less availability), but also incorporates the expected repair or recovery time (Mean Time To Repair). If system A fails one tenth as often as system B, but system B recovers in seconds, whereas system A takes many minutes to repair, system B could have much higher availability, despite its lower reliability.