You are hereArticles

Articles


Technical Topics

Understanding Large System Failures - A Fault Injection Experiment

Ram Chillarege and Nicholas S. Bowen
IBM Thomas J. Watson Research Center

Published: IEEE International Symposium on Fault Tolerant Computing, 1989.

Abstract: --
This paper uses fault injection to characterize large system failures. Thus, it overcomes limitations imposed by the lack of complete information in field failure data. The experiment is conducted on a commercial transaction processing system and this paper:

  • Introduces the idea of failure acceleration to conduct such experiments.
  • Estimates total loss of the primary service to occur in only 16% of the faults.
  • Reveals errors termed potential hazards that do not affect short term availability but cause catastrophic failure following a change in operating state.
  • Identifies at-least 41% of errors as potential candidates for repair before total failure. Estimates are provided on the likely location of such errors in terms of storage area and code-data split.

These results enhance our understanding of large system failures and provide a foundation for design enhancements, and modeling of availability.

Measurement of Failure Rate in Widely Distributed Software

Ram Chillarege, Shriram Biyani, and Jeanette Rosenthal
IBM Thomas J. Watson Research Center

Abstract: --
In the history of empirical failure rate measurement, one problem that continues to plague researchers and practitioners is that of measuring the customer perceived failure rate of commercial software. Unfortunately, even order of magnitude measures of failure rate are not truly available for commercial software which is widely distributed. Given repeated reports on the criticality of software, and its significance, the industry flounders for some real baselines.

  • This paper reports the failure rate of a several million line of code commercial software product distributed to hundreds of thousands of customers. To first order of approximation, the MTBF plateaus at around 4 years and 2 years for successive releases of the software. The changes in the failure rate as a function of severity, release and time are also provided.
  • The measurement technique develops a direct link between failures and faults, providing an opportunity to study and describe the failure process. Two metrics, the fault weight, corresponding to the number of failures due to a fault and failure window, measuring the length of time between the first and last fault, are defined and characterized.
  • The two metrics are found to be higher for higher severity faults, consistently across all severities and releases. At the same time the window to weight ratio, is invariant by severity. The fault weight and failure window are natural measures and are intuitive about the failure process. The fault weight measures the impact of a fault on the overall failure rate and the failure window the dispersion of that impact over time. These two do provide a new forum for discussion and opportunity to gain greater understanding of the processes involved.

Discovering Relationships between Service and Customer Satisfaction

Michael Buckley and Ram Chillarege
IBM Thomas J. Watson Research Center, 1995

Abstract: --

Organizations spend significant resources tracking customer satisfaction and managing service delivery. Although a great deal of effort is expended in understanding what goes on within each of these areas, little or no effort has been applied to identifying and quantifying the relationships between the two. The objective of this research is to discover and establish potential relationships between service data and customer satisfaction. This understanding will enable more effective management, which will lead to improved quality, reduced cost and increased customer satisfaction.

Software Defects and their Impact on System Availability - A Study of Field Failures in Operating Systems

Mark Sullivan and Ram Chillarege
IBM Thomas J. Watson Research Center

Abstract: --

In recent years, software defects have become the dominant cause of customer outage, and improvements in software reliability and quality have not kept pace with those of hardware. Yet, software defects are not well enough understood to provide a clear methodology for avoiding or recovering from them. To gain the necessary insight, we study defects reported between 1986 and 1889 from a on a high-end operating system product. We compare a typical defect (regular) to one that corrupts a program’s memory (overlay) given that overlays are considered by field services to be particularly hard to find and fix.

Generation of an Error Set that Emulates Software Faults based on Field Data

J. Christmansson and Ram Chillarege
Chalmers University of Technology, IBM Research, 1996


Abstract: --
A significant issue in fault injection experiments is that the injected faults are representative of software faults observed in the field. Another important issue is the time used, as we want experiments to be conducted without excessive time spent waiting for the consequences of a fault. An approach to accelerate the failure process would be to inject errors instead of faults, but this would require a mapping between representative software faults and injectable errors. Furthermore, it must be assured that the injected errors emulate software faults and not hardware faults. These issues were addressed in a study of software faults encountered in one release of a large IBM operating system product. The key results are:
  • A general procedure that uses field data to generate a set of injectable errors, in which each error is defined
    by: error type, error location and injection condition. The procedure assures that the injected errors emulate
    software faults and not hardware faults.
  • The faults are uniformly distributed (1.37 fault per module) over the affected modules.