Duplicating variables in high-level code is an effective technique
to detect hardware errors in memory and in processor registers. This paper
reports an evaluation of the cost (in terms of memory and CPU time overhead)
and effectiveness (in terms of error detection) of this simple software
redundancy technique for hardware error detection. The analysis allows
the designer to correctly choose the best trade-off for its application.
Introduction
Redundancy is a common answer to the increasing demand of high dependability
in safety-critical applications. In this paper we consider low-cost computer-based
systems; in such systems hardware redundancy is unacceptable and software
redundancy is the only design modification which can possibly be introduced
to improve the system fault tolerance.
Different software fault tolerant approaches have been proposed to address
different kinds of error sources [TJPr96][RMSi96]. To achieve a high degree
of fail-silent behavior in ordinary computers, the intrinsic Error Detection
Mechanisms (EDMs) of the system (exceptions, memory protection, etc.) can
be complemented with a set of carefully chosen software error detection
techniques [RMSi96]. These techniques include Algorithm Based Fault
Tolerance (ABFT) [HuAb84], Assertions, and Code Flow checking.
The technique adopted here is based on variable duplication: as a main
advantage, it can be automatically implemented on the high-level code of
the program. The paper performs some analysis about the cost and effectiveness
of this approach.
The FATO approach
This paper adopts the Software Error Detection approach called FATO
[BCPR97], based on source code modification and data redundancy,
that aims at increasing the robustness of the system with respect to hardware
errors in the memory and registers. The main idea of the approach is to
perform two different modifications to the source code; the first one corresponds
to duplicating some or all of the program variables in order
to introduce data redundancy and modifying all the operators to manage
the introduced replica of the variables. The second one aims at introducing
consistency checks inside the control flow to periodically verify
the consistency between the two copies of each variable. This approach
has been developed mainly to detect transient bit-flip faults in memory
locations storing the data and in the microprocessor's user registers.
Trading-off fault coverage and CPU Time overhead
In some cases, safety-critical applications have strict constraints
in terms of memory occupation and system performance. The duplication of
the whole set of variables and the introduction of a consistency check
before every read operation represent the optimum choice from the fault
coverage point of view. We consider only faults appearing in the data memory
during the life period of each variable. With this assumption, the duplication
of the whole set of variables guarantees that 100% of faults is covered
by this software redundancy technique. On the other side, by duplicating
a lower percent of variables one can trade-off the obtained fault coverage
with the CPU time overhead.
In this paper we propose an analysis of the code useful to choose a
good trade-off between system overhead and fault coverage achieved with
variable duplication and consistency checks for a low-cost safety critical
computer-based system. An experimental analysis is currently being performed,
considering a set of benchmark programs; the preliminary experimental results
reported here refer to a representative program which emulates a railway
control system in a small railway station with 2 railroads.
To avoid the duplication of the whole set of variables, we propose an
analysis of the data trace of a fault-free execution of the target application.
This allows to determine the life time of each variable (corresponding
to the cumulative length of the longest time intervals between a write
operation and the following read operations, not containing other write
operations). Variables are ordered according to their life time, and those
with the largest life time are duplicated. In fact, the larger the life
time is, the higher the probability that an error affecting this variable
occurs. We assume that all the faults in a duplicated variable are detected
by the FATO approach.
Fig. 1 shows the CPU time overhead introduced in the benchmark program
by duplicating a decreasing number of variables, selected according to
the above strategy. The fault coverage achieved by duplicating a subset
of variables ranging from 10% to 100% has been experimentally evaluated
and reported in Fig. 2. The experimental results with our benchmark show
that duplicating 50% of the variables is enough to guarantee a Fault Coverage
of 85% of the faults with a reduced CPU time overhead (28%).
Fig. 1: CPU time overhead vs. the percentage of duplicated variables.
Fig. 2: Fault Coverage vs. the percentage of duplicated variables.
Another parameter that has to be carefully set to decrease the code
size and the execution time overhead is the consistency check granularity.
In particular, different solutions can be chosen in terms of where to introduce
the consistency checks:
- after each read operation;
- at the end of each life period.
The number (provided that is greater than 1) and position of the consistency
checks do not modify the achieved fault coverage: if the variable is duplicated,
any fault appearing in its life period is detected. Introducing a check
after each read operation is the safest choice, giving the minimum latency
of a fault. On the other side, this solution gives a high overhead in terms
of CPU time. Introducing a check at the end of each life period increases
the latency with a reduced cost in terms of CPU time. A more precise analysis
of the link between the number of introduced consistency checks and the
latency is currently under analysis.
Conclusions
In this paper we report some data about the costs (in terms of CPU time)
and advantages (in terms of Fault Coverage) of a previously proposed technique
[BCPR97] for software redundancy to improve fault tolerance in low-cost
computer-based safety-critical systems. The proposed approach is based
on variable duplication and consistency check introduction. An exhaustive
application of this approach could dramatically increase the resources
(CPU time and memory) overhead; therefore, a balanced trade-off has to
be defined between fault coverage and overhead. We propose to study the
system fault-free behavior to decide the variables to be duplicated and
checked, ordering the variables taking into account their life period.
Work is currently being done in optimizing the automatic modification
of the source code in order to tune the number of variables to be duplicated
and the granularity of the consistency checks to be inserted in the code,
introducing more sophisticated techniques for code analysis.
References
[BCPR97] A. Benso, F. Corno, P. Prinetto, M. Rebaudengo,
M. Sonza Reorda, FATO: a software FAult TOlerant approach, 3th-IEEE International
On-Line Testing Workshop, July 1997 [HuAb84] K. H. Huang, J. A. Abraham,
Algorithm-Based Fault Tolerance for Matrix Operations, IEEE Trans. Computers,
vol. 33, Dec 1984, pp. 518-528
[PVBW88] D. Powell, P. Verissimo, G. Bonn, F. Waeselynck, D. Seaton, The
Delta-4 Approach to Dependability in Open Distributed Computing Systems,
Proc. FTCS-18, Tokyo (J), 1988, pp. 246-251
[RMSi96] M. Zenha Rela, H. Madeira, J. G. Silva, Experimental Evaluation
of the Fail-Silent Behavior in Programs with Consistency Checks, Proc.
FTCS-26, Sendaj (J), 1996, pp. 394-403
[TJPr96] D. Todd Smith, T. A DeLong, B. W. Johnson, J. A. Profeta III,
An Algorithm Based Fault Tolerant Technique for Safety Critical Applications,
Reliability and Maintainability Symposium, Philadelphia, PA, 1997