How Reproducible should Fault Injection Experiments be ?

Andreas Steininger 1
Vienna University of Technology, Department of Measurement Technology
A-1040 Vienna, Austria

Fault injection is the classical complement to formal methods in the area of fault tolerance assessment. It is applied to unveil fault tolerance deficiencies with the aims of (a) optimizing or verifying the fault tolerance mechanisms during the design phase and (b) predicting system fault tolerance behavior of an existing system in the application.

Reproducibility is a vital property of a fault injection experiment (as of any other scientific experiment). Since the common fault-tolerance ratings are derived by means of statistics, two levels of reproducibility exist: First, repeating the whole experiment series must result in almost identical statistical ratings, which is a necessary proof for a meaningful definition and proper statistical coverage of the fault/activity space. Second, performing a single fault injection twice may produce the identical result in detail, if setup and injection method are carefully chosen. This high degree of reproducibility is not a necessary condition for the usefulness of the results and therefore often disregarded in practical setups. The benefits of such a detailed reproducibility, however, are impressive:

 

  • In a typical experimental series a huge number of faults are injected to achieve a high coverage of the fault/ activity space. The observation of system reaction must often be limited to a fail/no fail information, as recording of detailed behavioral information like control flow, bus timing etc. is ruled out by memory limitations. Of these numerous injections (hopefully) only few will actually unveil fault tolerance deficiencies. It is a substantial simplification of the fault-tolerance optimization process, if such single alarming results can be reproduced and more comprehensive data records can be collected to support a detailed analysis.
  • The same is true for doubtful single results: Experimental readouts that are surprising in some respect can be verified or possibly traced to setup deficiencies.
  • A direct comparison of different fault-tolerance strategies can be performed with identical fault sets. Statistical uncertainties do not apply.
  • Similarly, the contribution of a single error detection mechanism (EDM) to overall coverage can be directly assessed by comparison of a setup with and without the EDM. This is normally a problem, since a dominant EDM may mask subsequent activation of others.
  • The effectiveness of optimization measures that have been taken to eliminate fault tolerance deficiencies becomes directly observable.
  • Exactly reproducible system reaction to a given fault set provides an ideal foundation for certification and conformance testing.

Detailed reproducibility implies exact controllability of the fault injection, which has several further benefits:

  • It is possible to directly provoke a worst-case scenario.
  • A high degree of fault activation can be achieved, simply by avoiding ineffective injections.
  • Detailed knowledge of the characteristics of the injected fault provides an ideal support for stratified sampling techniques. We have clear indications of whether we have sufficiently varied all relevant parameters.

Although these benefits sound tempting, detailed reproducibility is hard to achieve and puts a lot of restrictions on the experimental setup. Generally, three conditions must be fulfilled:

 

  1. The injected fault must be reproducible (including its manifestation on the system).
  2. The timing relation between injected fault and target system must be controllable.
  3. The behavior of the target system must be reproducible.

The timing resolution under which these conditions must be met is determined by the system clock period and the respective setup/hold windows. Let us now investigate the implications of these conditions:

Condition (1) - reproducibility of the fault
While it is generally not a problem to exactly reproduce a fault in a simulation environment or by software-implemented fault injection, reproducibility is extremely hard to attain for physical fault injection: The statistical nature of heavy ion radiation and the high sensitivity of the effects of EMI to slightest variations in topology or environmental conditions (temperature, humidity) are prominent examples for the lack of controllability of physical fault injection. The situation is equally bad for disturbances of the power supply where current peaks, parasitic inductance, bypass capacitors, varying thresholds etc. introduce temporal uncertainties far beyond one processor clock period. Methods like pin forcing or insertion facilitate better controllability, but maintaining a precise fault pulse with constant shape in a 50MHz system is still a challenge.
=> As a resume, the demand for reproducibility severely restricts the choice of applicable fault injection techniques, particularly with respect to physical-level fault injection.

Condition (2) - controllable timing relation
The necessary timing relation between system operation and fault occurrence can be established by synchronizing the fault injection (a) to system operation by means of triggering and (b) to system clock. Additionally, any temporal jitter of the fault injection must be minimized.
=> Like above, this condition is quite easily fulfilled with simulation-based or software-implemented fault injection, while hardware-based methods are difficult or even impossible to apply.

Condition (3) - reproducible behavior of the target system

(3a) Hardware: Although hardware operation appears strictly deterministic from a macroscopic point of view, there are many sources of uncertainty on a microscopic (clock related) level, such as interrupts, asynchronous communication, hard disk access, random time-out after bus access conflicts, caching or DRAM refresh. Even second-order effects like clock jitter or variations of threshold with temperature and power supply may impede reproducibility.
=> Reproducibility of the hardware operation can be maintained with an extremely careful setup. However, some restrictions on the design and on the operational modes during the experiments must be made.

(3b) Software (operating system and application): Deterministic behavior of software is usually not an issue, however, numerous uncertainties are often introduced from the operational environment. All problems with varying and asynchronous inputs that are known in context with replica determinism apply here as well.
=> As a consequence, if we want to ensure reproducible conditions we soon end up with a synthetic environment.

As a conclusion of the above analysis we find, that detailed reproducibility can be attained in practice, but the efforts may be quite high. Whether the cost is balanced by the benefits, depends on the particular situation.

We can further conclude that reproducibility ultimately implies many severe restrictions on the setup. One might even argue that we are performing our experiments in a synthetic, well-behaved world to make them reproducible. But do our results then allow any conclusions on the system behavior for the "real world" situations not covered by our assumptions? What about borderline cases like setup/hold violations? What about other environmental conditions and inputs?

On the first glance these arguments give us a bad feeling about too restrictive, well-behaved experimental assumptions, so finally the question "Detailed reproducibility - yes or no?" turns out to be not only a matter of cost and benefit, but ends in the fundamental question "Is a realistic but uncontrolled experiment better than a synthetic but well-controlled one?" Since the assumption coverage, the correspondence between experimental setup and actual field operation, is a crucial issue for each fault injection study, we tend to prefer the realistic type of setup. Known sources of mismatch are the applied system model, the assumed operating environment and the fault hypothesis, therefore the injection of real-world faults into a physical prototype appears most promising.

In a well-controlled type of environment we must make numerous assumptions on setup parameters and it is often hard to find realistic values, since we can usually not specify the anticipated real-world-faults in full detail. Why don't we have to make that many assumptions with a realistic setup? The answer is simple: We do make all of these assumptions in a realistic setup as well, but they are implicit and therefore invisible.

  • We do select a specific prototype system with actual threshold values and actual propagation delays,
  • we do operate it with an actual value of the supply voltage under actual environmental conditions,
  • we do apply a limited set of faults with actual parameters,
  • ...

So with our realistic setup we still pick out a subset from all possible experimental conditions. The main difference is that the assumptions are now implicit. This has the advantage that we have a fair chance that our setup meets the real-world conditions (unless our prototype system represents an extreme in some respect), and we may detect a coverage deficiency in an unanticipated constellation. This is an important advantage, since it is always the unexpected that leads to catastrophes. However, there are also a severe drawback: Our assumptions may be realistic, but this does not imply that they are representative. How can we be sure that we have sufficiently varied all relevant setup parameters if we do not even see them? When have we performed a sufficient number of experiments to have confidence in the results? Will we get different results with an other physical target system of the same type? Moreover, the documentation of the experimental setup is difficult.

These problems finally remind us to the advantages of the well-controlled, reproducible setup that gives us the feeling of clarity and completeness.

Apparently, the realistic setup is preferable, if the purpose of the fault injection is fault tolerance prediction, while the benefits of detailed reproducibility are more prevalent for fault tolerance verification or optimization. But is there any reason to generally disregard the benefits of detailed reproducibility?


1. Author contact:
Department of Measurement Technology, Vienna University of Technology,
Gusshausstrasse 25, A-1040 Vienna, Austria.
Phone: +43-1-58801-3596, Fax: +43-1-5875998.
E-Mail: Andreas.Steininger@tuwien.ac.at.