ODC - Orthogonal Defect Classification

next up previous
Next: References Up: Measurement of Failure Previous: Fault Weight and Failure

Summary

This paper addresses a critical need in the fault tolerant computing area of today. As commercial software is distributed to hundreds of thousands of customers, and studies point to software as a significant contributor to outage and failure, measuring the actual failure rate perceived by an end user becomes critical. This one parameter has a significant influence on the overall design for dependability. It has been an elusive parameter to measure, with several indirect methods proposed, and despite the focus, with little data available. Sadly, experts grope even for an order of magnitude measure for the failure rate of such commercial software.

This paper address the above need and provides quantitative results. The study is conducted on two releases of a software product (with several million lines of code) distributed to a large customers base. The key difference between this study and several others is that we use failures reported into the service process, from a world wide customer base, to directly measure failure rate. This, however, is not an easy task. To isolate just those software failures belonging to a product, we tie each reported failure to the fault that caused it, so that it can be traced to the specific product. This technique yields, to the best of our knowledge, some of the finest measurements for a commercial software product. It also allows us to provide very detailed quantified results. Among some of them reported are:

  1. The failure rates of Release 1 and Release 2 of this product plateau at about 0.02 and 0.04 failures per machine-month, respectively. The plateaus occur around 3 years and 18 months after the respective release dates. To a first order of approximation, they correspond to MTBFs of 4 years and 2 years. We also present the change in failure rate as a function of time, release, and severity. The order of magnitude measures would be eye openers for the industry.
  2. The fault weight, (number of failures due to a fault), of the higher severity faults tends to be higher than that for the lower severity faults. This may sound intuitive and obvious, but for the fact that the assignment of severity is an entirely qualitative judgment based on customer feelings and service representative's opinion. A similar trend is noted for the failure window, (length of time between the first and last occurrence of failures for a particular fault). The windows are larger for the higher severity faults. These trends are systematic between all severities and releases.
  3. The failure window and fault weight provide natural and intuitive measurements of the failure process. The fault weight measures the impact of a fault on the overall failure rate. The failure window on the other hand measures the dispersion of that impact over time. These metrics provide a new forum for discussion and an opportunity to better model, understand and possibly control the failure process across a customer base.

This paper should be of interest to all working in the area of fault tolerance of systems that contain software.

0=6 =0.750 .55 -0 =.9

0


next up previous
Next: References Up: Measurement of Failure Previous: Fault Weight and Failure

rchill
Wed Mar 31 12:29:44 EST 1999