![]() |
||
ODC - Orthogonal Defect Classification |
||
|
|
Next: Software Triggers and ODC Up: Software Triggers as Previous: Software Triggers as
IntroductionOne of the key aspects of failure generation is the activation process by which a software fault causes an error and then results in a failure. This has broadly been attributed to usage patterns or environmental conditions that are prevalent under different circumstances. However, there has been very little work that tears apart the activation process in order to provide a much greater understanding of the mechanisms that cause faults to surface. The seemingly random process of failure occurrence is relatively well understood when dealing with hardware and technology where faults have comparatively short latencies. In contrast, software fault latencies are quite large, and are much less understood. Thus, the triggering mechanisms that terminate the fault latency hold the key to a large part of this understanding. This paper explores this opportunity and provides some understanding of the triggering process. Specifically, the paper provides the distributions of the triggers that caused faults to be activated resulting in failures. To do so, we have studied the faults from an operating system product, over a period of two years after release into the field. In addition to providing the distribution of the triggers, these distributions are studied as to how they change as a function of time. The change as a function of time is the first ever reported and illustrates the dynamics of the customer environment. The value of this study is that it brings into active discussion what we believe is one of the key aspects in the failure generation process, namely, triggers. Triggers influence several issues, some academic and some practical. They capture the mix of the customer environment and measure the aggregate operational profile [MIO87]. Thus, it makes for excellent input towards estimation and projection of software reliability. On the other hand, triggers quantify one of the critical contributors that varies the acceleration in a fault injection experiment. For the successful development of the fault injection methodologies [CB89], [Ae90], one needs to understand what makes faults generate errors and result in failures. Triggers are the factors that makes it happen. Thus, understanding and quantifying triggers and their characteristics provides the guidance for research, modelling and importantly, driving the design of fault injection experiments. To conduct this study, we have chosen an operating system product with several million lines of code. The product enjoys a large customer base, with significant install growth within a short period of time after being made available. It is a system-wide product exploited in a large environment. Customers have full support for all types of software problems, with software fixes dispatched for all known faults. The product is renowned for its dependability. Data from field failures spanning a two-year period in the field is used to identify triggers and drive the analysis. To help communicate the results of this paper, we begin with a brief discussion of what triggers are. Although triggers have been introduced in our earlier work, given that it is a new concept, the paper benefits with some explanation and clarification. This is followed by a set of definitions on triggers, which are currently not available in our publications. The data from this study is then presented with the discussions. We believe that this article should be of interest to a fairly wide audience in the dependable computing industry. Specifically, it would interest the fault injection community given that the trigger is a key parameter, which controls the acceleration of failure in a system. In addition, the activation mechanism is particularly relevant as to when and where to inject faults. This is usually a difficult question in software given that the sample space is large, both in time and space. Since the trigger is the catalyst for failure, it provides the additional guidance on prescribing the environmental conditions under which faults need to be injected to emulate real world conditions. The work should also be of interest to reliability modelling and prediction, since the trigger is one of the dominant parameters that can be related to the probability with which faults turn into failure in the field. Finally, once we have an understanding of the catalyst, it motivates the designers of fault tolerance to conceive of methods to detect and subvert failure in the field. Thus, we believe that this paper and its results have a far-reaching relevance to the dependable computing community at large.
Next: Software Triggers and ODC Up: Software Triggers as Previous: Software Triggers as rchill Mon Mar 29 18:54:02 EST 1999 |
|