Ram Chillarege and Nicholas S. Bowen
IBM Thomas J. Watson Research Center
Published: IEEE International Symposium on Fault Tolerant Computing, 1989.
This paper uses fault injection to characterize large system failures. Thus, it overcomes limitations imposed by the lack of complete information in field failure data. The experiment is conducted on a commercial transaction processing system and this paper:
- Introduces the idea of failure acceleration to conduct such experiments.
- Estimates total loss of the primary service to occur in only 16% of the faults.
- Reveals errors termed potential hazards that do not affect short term availability but cause catastrophic failure following a change in operating state.
- Identifies at-least 41% of errors as potential candidates for repair before total failure. Estimates are provided on the likely location of such errors in terms of storage area and code-data split.
These results enhance our understanding of large system failures and provide a foundation for design enhancements, and modeling of availability.
An important attribute of a fault injection experiment is the ability to measure in the lab what could not be done in the field. This includes many aspects of the failure process including, when possible, cause and effect analysis. Interestingly,there seems to be a mechanism that allows for a number of measurement, if a few conditions are met, or if the system can be stressed towards meeting these conditions. These conditions result in accelerating the failure process. We first define it formally and then discuss its attributes.
Excerpt from the Section II. DESIGN OF THE EXPERIMENT
Definition - Failure Acceleration:
The failure process is said to be accelerated when, the fault model is not altered and:
- 1. The fault latency is decreased.
- 2. The error latency is decreased.
- 3. The probability of a fault causing a failure is increased.
Definition - Maximum Failure Acceleration:
The failure process is said to be maximally accelerated when, the fault model is not altered and:
- 1. The fault latency is zero.
- 2. The error latency is a minimum.
- 3. The probability of a fault causing a failure is maximized.