![]() |
||
ODC - Orthogonal Defect Classification |
||
|
|
Next: Results Up: Service Process Previous: Data
Failure/Fault RelationshipOne might wonder, why bother with faults, when we are interested in failure rate? The issue is one of identifying the right subset of all reported failures of a product, to just those that are due to the product. When a defect oriented problem call comes in, it is routed to a team that knows the family of products. However, it is not immediately evident which of the numerous software products caused the failure. It could be that the failure is manifested in one product, whereas the failure actually occurs at a layer below in another product, (e.g. the operating system), or in a peer middleware software product with which the incident product shares resources. Therefore, before we start counting failures to measure failure rates, we need to identify the right subset of failures that pertain to the product under study. Establishing the link with the fault in the product is an unequivocal way to make certain the failure is related to the specific product in question. This is because the fault record (APAR) ultimately identifies the line(s) of code that causes the failure. Establishing the link between the fault and failure is a simple concept in theory but a complicated task in practice. There are several issues that need to be grappled with. We were fortunate to be able to do so with the available data. It involved not only a clear understanding of the content of disparate databases, but also a detailed set of interviews with service personnel. Essentially, we needed to understand their practices of recording data, and the idiosyncrasies of the Problem and APAR tracking systems. To illustrate a few of the challenges: Failures and faults are tracked separately and handled by different segments of the service process. We found that data about a fault did not generally exist in the failure record initially, however, it did when the problem was resolved. Where was it and was it consistently retrievable? We found that the fault was tagged in the failure record for the product. This was a significant step forward, because it meant that we would be able to extract reasonably accurate data. Another issue, that is not apparent to the casual observer is the sheer volume of failure data that needs to be processed. Failure data involves a significant amount of dialog back and forth between the customer and the service person, and is therefore voluminous. Given the size, most of it is not saved and is deleted periodically. A subset of the data, with key fields are selectively archived. This data, from a family of products across world wide service needs to be processed against the fault data to identify the failures due to the product. Once we extract the failures using the faults in the product, we have the necessary event stream to compute failure rates. Given the event sequence of failures and faults, across the customer base, we have an opportunity to learn more about this process. To describe and understand the fault and failure processes we define two metrics: failure window and fault weight. The failure window is the length of time between the first and last known occurrence of a failure from a specific fault. In figure 2, the periods of time W1 and W2 are the failure windows for faults A1 and A2 respectively. Essentially, each fault has a non zero failure window, if it ever reappeared in the customer base. The last occurrence is clearly dependent on the observation period, and like all real failure data, is right censored. The fault weight is the number of failures resulting from a fault. In other words, it is the number of failures that are observed during its failure window. In this example, the fault A1 has a weight of 3 and the fault A2 has a weight of 2. Extending the time interval for observation could result in a longer failure window and larger fault weight for a given fault. The failure window should not be confused with the grouping of error events, occuring within a short interval of time, into tuples [TS83, LS90, IYS86]. The use of tuples is primarily to group a burst of errors that occurs into one logical failure event. This coalescing process is done on errors, occuring on the same machine, since they are most likely related to each other. The failure window is a different measure from such coalescence time periods, being defined on failures occurring on different systems and possibly different customer accounts, over prolonged periods of time. The window is defined on failures appearing across the entire install base, whereas the previous efforts focused on multiple failures reported on a single system. Usually when a problem is reported via the service process, the human has already coalesced any multiple error conditions into one logical failure. Thus the failures we analyze are already coalesced. The two measures, fault weight and failure window are natural to the fault and failure process. The purpose of describing them is not for measurement of the failure rate from observed data. The failure rate can be estimated without the use of these metrics. However, these metrics provide the intuition to understand the failure process better. We will, therefore, provide statistics of fault weight and failure window. The fault weight is an indicator of the frequency or rareness of a failure resulting from a given fault. The failure window suggests the length of exposure from a particular fault, and provides a measure of dispersion of the failures resulting from a given fault. Together, they provide us with significant insight into the composition of the overall failure rate. One advantage of the weight and window metrics is that it opens the possibility of constructing better reliability models. In this paper, we do not pursue that, but lay the foundation by providing the statistics of these metrics. The paper of [Ada84] is an example of such an effort, but does not get into the concept of failure window. It does divide faults into different classes based on failure rates, thereby indirectly using a weight like metric.
An important ingredient to the computation of the failure rate of software is the install base in the field. Fortunately, this data is usually well tracked given that it has to do with product revenue. The computation of install base considers, existing licences, new licences, and migration from old releases into new releases. Figure 3 shows how these curves may look. Usually, sales of new computers occur with the latest release of software, and there is also a migration from the earlier release to the new one. This data is used to normalize the failure rate to a per machine basis. Another factor that needs to be compensated for is the under-reporting of problems into the service center. Under-reporting is a function of the way customers deal with problems and the services and system management associated with their system. The number of problems reported into the service process is only a fraction of what occurs in the field. Through customer surveys and interviews with people knowledgeable in service, we believe that the under-reporting is between 80 to 90 percent. This translates to a failure rate, which is about 5 to 10 times of what is reported. For this paper, we will use a factor of 10 to scale up the reported failure rate to arrive at the adjusted failure rate. Since we only have a rough estimate of the level of under-reporting, we can only assert that our scaled failure rate is within an order of magnitude of the actual failure rate. We do not attempt to assign a confidence level to this claim, since there is no objective basis for doing so. Given how little is known about real failure rates for commercial software, getting an order of magnitude estimate is, we believe, a significant achievement.
Next: Results Up: Service Process Previous: Data rchill Wed Mar 31 12:29:44 EST 1999 |
|