ODC - Orthogonal Defect Classification

next up previous
Next: Fault Weight and Failure Up: Measurement of Failure Previous: Failure/Fault Relationship

Results

Once we have the failure events per month, the license base and the under-reporting factors, we can compute the failure rate on a per machine-month basis. Figure 4 shows the overall reported failure rates of two releases of software - superimposed on each other, unadjusted for under-reporting. The time axis for each has been positioned to reflect months after release into the field. The ordinate is in units of failures per machine per month. For release 1, the graph shows only the data starting from around month 10. Earlier data was unavailable in the database. It may be safe to presume that the failure rate for earlier months is higher.

   figure92
Figure 4: Software Failure Rates of Release 1 and 2

We observe that, about three months after the release, the failure rate for release 2, adjusted for under-reporting, is around 0.2 tex2html_wrap_inline439 , while release 1 achieved the same level after about 10 months in the field. We also observe that the failure rate for Release 2 has come down rapidly in the first year in the field and has been decreasing more slowly since. On the other hand, for release 1, the failure rate has started from a presumable higher level and has continued to drop rapidly over a longer period. About 21 months after the general availability date, the failure rate for Release 1 is less than that of release 2 after the same duration in the field. The continued rapid drop for Release 1 may be partly due to less support for the older release, following the availability of the new release. It may also be presumed that the users remaining with release 1 are the ones experiencing fewer major problems.

The failure rates, adjusted for underreporting, when plateaued, are around .02 per machine per month for Release 1 and around .04 for release 2, corresponding to mean times between failures of about 4 years and 2 years, respectively. These numbers provide one of the first order of magnitude estimates for perceived failure rates of widely used operating systems. The high MTBF for Release 1 was attained only after being in the field for over 3 years.

   figure97
Figure 5: Failure Rate by Severity - Release 1

   figure102
Figure 6: Failure Rate by Severity - Release 2

The failure rate that is shown in the earlier figure is the combination of failures of different severity classes. In IBM, each failure (and the underlying fault) is categorized into severity 1, 2, 3, or 4 which represent decreasing degrees of severity. Severity 1 usually implies a failure that disables every application on the operating system. Severity 2 implies a serious, but not catastrophic disruption of work. Severity 3 and 4 are much less severe failures that would usually be considered annoyances. In the next two figures, we divide the overall failure rate into that contributed by the different severity classes. Figures 5 and 6 show the failure rates of releases 1 and 2 (unadjusted for underreporting), separated by the different severity classes. It is evident that the severity 1 failures are much lower than severities 2 and 3. Failures of all severities follow roughly the same decreasing trend as a function of time.




next up previous
Next: Fault Weight and Failure Up: Measurement of Failure Previous: Failure/Fault Relationship

rchill
Wed Mar 31 12:29:44 EST 1999