Glossary of Availability Terminology

A Brief Glossary of Availability Terminology
$Id: haterms.html 118 2007-10-31 06:18:29Z Mark $

If we are to improve the reliability and robustness of our software, we must be able to intelligently discuss these things, and to distinguish between distinct but related concepts. The following is a list of basic terms in reliability engineering. These terms are often mis-used, even by experienced engineers. It is important that you understand the definitions of these terms and be able to use them correctly. This will enable you to to think and communicate more effectively.

Defect
1. Something in the design or implementation that does not conform to the specifications or intended behavior.
  The latter is mentioned because the specifications themselves may be incorrect or incomplete.
2. Something in the design or implementation that pre-disposes a component to error or failure.
In software, defects are most commonly referred to as bugs. There is nothing wrong with calling defects bugs, but make certain that you do not use this term to refer to faults, errors, or failures.
It should, also, be noted that Grace Hopper's original bug was not defect but rather a fault.
Fault
An incident where a defect is exercised, causing a component to malfunction (not function as intended). Depending on the nature of the underlying defect, the incident may be precipitated by normal use, unusual use, external events, or a complex combination of such factors.
A defect is merely predisposition to error. If the defect is never put to the test, the component could go its entire life-time without ever malfunctioning.
Note that the use of the word fault in general failure terminology is awkwardly similar (but not identical) to its use in software: the process whereby a computer detects and notifies software of a malfunction or execution failure.
Error
An incident where a component malfunctions (produces unexpected output). This usually the result of a fault occurring in a defective component.
The term "unexpected" is relative. One might consider throwing a zero-divide exception to be an "unexpected" result during the evaluation of an arithmetic expression, even though the computer instruction set and language both specify this behavior in this situation. In this situation a defect in a program caused a fault in its execution, which was detected by the CPU, and reflected (back to the program) as an error.
Failure
An incident where a system fails to provide services as expected.
The relationship between errors and failures is a nuanced one:
1. Reporting an error is not a failure.
  Thus, in the zero-divide example, throwing an exception does not represent a failure of the CPU, compiler, or operating system. All are providing service exactly as they were specified to do.
2. All errors do not give rise to failures.
  Robust systems have the ability to recovery from errors. If the program that experienced the zero-divide-exception caught it, sent back an error message complaining of invalid input, and then continued correctly processing new requests, it would have experienced an error but not a failure.
3. All component failures to not necessarily give rise to system failures.
  Robust systems may have the ability to repair failed components, to switch operation over to spare components, or to continue functioning with reduced capacity. If a web server fails and a front-end switch detects this (e.g. by noting that it took more than five seconds to respond to the last request), the switch could retransmit the request to a different server. Even though there was a complete failure of a key resource, the user might not even experience an error ... but merely a briefly delayed response.
Correctness
The degree to which a component is free of defects.
Robustness
The ability of a component, or system of components to avoid failure in the face of faults and errors.
This is not at all the same as correctness, but rather is a complementary property. A relatively correct system may have very few defects. A robust system may have defects, be subject to faults, and experience errors ... but still avoid failure through some combination of detection, correction, repair, and redundancy.
Many engineers believe robustness to be both more useful and more achievable than correctness. This is, in part, due to the difficulty of achieving perfect correctness, but in greater part due to the fact that a robust system is more likely to survive in the face of unanticipated errors.
Reliability
The likelihood that a component or system will not fail during a specified period of time. Reliability is typically quantified in one of three ways:
1. probability of failure
  For components with a rated life, that may be the implied time period. Thus, if a light bulb with a rated 10,000 hour life, had a 1.5% chance of burning out in the first 10,000 hours, we would say that its reliability was 98.5% (over its rated life).
  For components that do not experience normal aging, a specific time period can be specified. My Windows 98 systems had about a 50% chance of staying up for one day, but my Windows XP system seems to have a 90% chance of staying up for a week or more.
2. MTTF (Mean Time to Failure)
  For components that are expected to fail during their life times, we can multiply the time period by the probability of failure, and can obtain a statistical average failure interval.
  My Windows 98 system had an MTTF of 4 hours. My XP system has an MTTF in excess of 150 hours.
3. FIT rate (Failures in Time)
  For components that are not expected to fail during their life times, we can extrapolate the probability of failure to a very large population (or time period) and look at the number of expected failures per billion hours.
  The soft error rate associated with cosmic ray hits to semi-conductor memory is in on the order of 100 FITs/megabit.
Assuming standard distributions, these three quantifications should all be interchangeable (in that any could be computed from the others). In the last 10-20 years, however, availability analysts have started moving away from the most venerable of these forms (MTTF):
Availability
For components that are likely to experience failure during their rated lives, but can be repaired (restored to service after a failure) it is meaningful to talk about the likelihood that a component or system will be providing service at any particular instant (in steady-state operation over a long period of time). The availability of a system is a function of both its failures and its repairs.
The quantification of failures was described above (under reliability). Repair can also be quantified by either a mean time (Mean Time To Repair) or as a rate (e.g. number of repairs per billion hours).
Availability is typically quantified in one of two ways:
1. as Mean Time Between Failures (MTBF)
  In most cases this is equal to the MTTF. There are, however, situations where the Mean Time to first Failure is very large, but the Mean Time to subsequent Failures is much smaller. In such cases, the steady-state MTBF would be equal to the Mean Time to subsequent Failures.
2. as a probability
  Av = expected up-time / ( expected up-time + expected down-time)
  which can be approximated as MTBF / (MTBF + MTTR)
  Availability is also (colloquially) expressed in nines:
  
  availability nines annual down time
  
  99% 2 3.6 days
  
  99.9% 3 8.8 hours
  
  99.99% 4 52 minutes
  
  99.999% 5 5 minutes
  
  99.9999% 6 31 seconds
  
  99.99999% 7 3 seconds*
  
  *in your dreams!
Availability is affected by reliability (in that more failures imply less availability), but also incorporates the expected repair or recovery time (Mean Time To Repair). If system A fails one tenth as often as system B, but system B recovers in seconds, whereas system A takes many minutes to repair, system B could have much higher availability, despite its lower reliability.

availability	nines	annual down time
99%	2	3.6 days
99.9%	3	8.8 hours
99.99%	4	52 minutes
99.999%	5	5 minutes
99.9999%	6	31 seconds
99.99999%	7	3 seconds*