How Reproducible should Fault Injection Experiments be ?
Andreas Steininger 1
Vienna University of Technology, Department of Measurement Technology
A-1040 Vienna, Austria
Fault injection is the classical complement to formal methods in the
area of fault tolerance assessment. It is applied to unveil fault tolerance
deficiencies with the aims of (a) optimizing or verifying the fault tolerance
mechanisms during the design phase and (b) predicting system fault tolerance
behavior of an existing system in the application.
Reproducibility is a vital property of a fault injection experiment (as
of any other scientific experiment). Since the common fault-tolerance ratings
are derived by means of statistics, two levels of reproducibility exist:
First, repeating the whole experiment series must result in almost identical
statistical ratings, which is a necessary proof for a meaningful definition
and proper statistical coverage of the fault/activity space. Second, performing
a single fault injection twice may produce the identical result in detail,
if setup and injection method are carefully chosen. This high degree of
reproducibility is not a necessary condition for the usefulness of the results
and therefore often disregarded in practical setups. The benefits of such
a detailed reproducibility, however, are impressive:
- In a typical experimental series a huge number of faults are injected
to achieve a high coverage of the fault/ activity space. The observation
of system reaction must often be limited to a fail/no fail information,
as recording of detailed behavioral information like control flow, bus
timing etc. is ruled out by memory limitations. Of these numerous injections
(hopefully) only few will actually unveil fault tolerance deficiencies.
It is a substantial simplification of the fault-tolerance optimization
process, if such single alarming results can be reproduced and more comprehensive
data records can be collected to support a detailed analysis.
- The same is true for doubtful single results: Experimental readouts
that are surprising in some respect can be verified or possibly traced
to setup deficiencies.
- A direct comparison of different fault-tolerance strategies can be
performed with identical fault sets. Statistical uncertainties do not apply.
- Similarly, the contribution of a single error detection mechanism (EDM)
to overall coverage can be directly assessed by comparison of a setup with
and without the EDM. This is normally a problem, since a dominant EDM may
mask subsequent activation of others.
- The effectiveness of optimization measures that have been taken to
eliminate fault tolerance deficiencies becomes directly observable.
- Exactly reproducible system reaction to a given fault set provides
an ideal foundation for certification and conformance testing.
Detailed reproducibility implies exact controllability of the fault injection,
which has several further benefits:
- It is possible to directly provoke a worst-case scenario.
- A high degree of fault activation can be achieved, simply by avoiding
ineffective injections.
- Detailed knowledge of the characteristics of the injected fault provides
an ideal support for stratified sampling techniques. We have clear indications
of whether we have sufficiently varied all relevant parameters.
Although these benefits sound tempting, detailed reproducibility is hard
to achieve and puts a lot of restrictions on the experimental setup. Generally,
three conditions must be fulfilled:
- The injected fault must be reproducible (including its manifestation
on the system).
- The timing relation between injected fault and target system must be
controllable.
- The behavior of the target system must be reproducible.
The timing resolution under which these conditions must be met is determined
by the system clock period and the respective setup/hold windows. Let us
now investigate the implications of these conditions:
Condition (1) - reproducibility of the fault
While it is generally not a problem to exactly reproduce a fault in a simulation
environment or by software-implemented fault injection, reproducibility
is extremely hard to attain for physical fault injection: The statistical
nature of heavy ion radiation and the high sensitivity of the effects of
EMI to slightest variations in topology or environmental conditions (temperature,
humidity) are prominent examples for the lack of controllability of physical
fault injection. The situation is equally bad for disturbances of the power
supply where current peaks, parasitic inductance, bypass capacitors, varying
thresholds etc. introduce temporal uncertainties far beyond one processor
clock period. Methods like pin forcing or insertion facilitate better controllability,
but maintaining a precise fault pulse with constant shape in a 50MHz system
is still a challenge.
=> As a resume, the demand for reproducibility severely restricts the
choice of applicable fault injection techniques, particularly with respect
to physical-level fault injection.
Condition (2) - controllable timing relation
The necessary timing relation between system operation and fault occurrence
can be established by synchronizing the fault injection (a) to system operation
by means of triggering and (b) to system clock. Additionally, any temporal
jitter of the fault injection must be minimized.
=> Like above, this condition is quite easily fulfilled with simulation-based
or software-implemented fault injection, while hardware-based methods are
difficult or even impossible to apply.
Condition (3) - reproducible behavior of the target system
(3a) Hardware: Although hardware operation appears strictly
deterministic from a macroscopic point of view, there are many sources of
uncertainty on a microscopic (clock related) level, such as interrupts,
asynchronous communication, hard disk access, random time-out after bus
access conflicts, caching or DRAM refresh. Even second-order effects like
clock jitter or variations of threshold with temperature and power supply
may impede reproducibility.
=> Reproducibility of the hardware operation can be maintained with an
extremely careful setup. However, some restrictions on the design and on
the operational modes during the experiments must be made.
(3b) Software (operating system and application): Deterministic
behavior of software is usually not an issue, however, numerous uncertainties
are often introduced from the operational environment. All problems with
varying and asynchronous inputs that are known in context with replica determinism
apply here as well.
=> As a consequence, if we want to ensure reproducible conditions we
soon end up with a synthetic environment.
As a conclusion of the above analysis we find, that detailed reproducibility
can be attained in practice, but the efforts may be quite high. Whether
the cost is balanced by the benefits, depends on the particular situation.
We can further conclude that reproducibility ultimately implies many
severe restrictions on the setup. One might even argue that we are performing
our experiments in a synthetic, well-behaved world to make them reproducible.
But do our results then allow any conclusions on the system behavior for
the "real world" situations not covered by our assumptions? What
about borderline cases like setup/hold violations? What about other environmental
conditions and inputs?
On the first glance these arguments give us a bad feeling about too restrictive,
well-behaved experimental assumptions, so finally the question "Detailed
reproducibility - yes or no?" turns out to be not only a matter of
cost and benefit, but ends in the fundamental question "Is a realistic
but uncontrolled experiment better than a synthetic but well-controlled
one?" Since the assumption coverage, the correspondence between experimental
setup and actual field operation, is a crucial issue for each fault injection
study, we tend to prefer the realistic type of setup. Known sources of mismatch
are the applied system model, the assumed operating environment and the
fault hypothesis, therefore the injection of real-world faults into a physical
prototype appears most promising.
In a well-controlled type of environment we must make numerous assumptions
on setup parameters and it is often hard to find realistic values, since
we can usually not specify the anticipated real-world-faults in full detail.
Why don't we have to make that many assumptions with a realistic setup?
The answer is simple: We do make all of these assumptions in a realistic
setup as well, but they are implicit and therefore invisible.
- We do select a specific prototype system with actual threshold values
and actual propagation delays,
- we do operate it with an actual value of the supply voltage under actual
environmental conditions,
- we do apply a limited set of faults with actual parameters,
- ...
So with our realistic setup we still pick out a subset from all possible
experimental conditions. The main difference is that the assumptions are
now implicit. This has the advantage that we have a fair chance that our
setup meets the real-world conditions (unless our prototype system represents
an extreme in some respect), and we may detect a coverage deficiency in
an unanticipated constellation. This is an important advantage, since it
is always the unexpected that leads to catastrophes. However, there are
also a severe drawback: Our assumptions may be realistic, but this does
not imply that they are representative. How can we be sure that we have
sufficiently varied all relevant setup parameters if we do not even see
them? When have we performed a sufficient number of experiments to have
confidence in the results? Will we get different results with an other physical
target system of the same type? Moreover, the documentation of the experimental
setup is difficult.
These problems finally remind us to the advantages of the well-controlled,
reproducible setup that gives us the feeling of clarity and completeness.
Apparently, the realistic setup is preferable, if the purpose of the
fault injection is fault tolerance prediction, while the benefits of detailed
reproducibility are more prevalent for fault tolerance verification or optimization.
But is there any reason to generally disregard the benefits of detailed
reproducibility?
|