Evaluating cost and effectiveness of software redundancy techniques for hardware errors detection

M. Rebaudengo, M. Sonza Reorda 1 
Politecnico di Torino
Dipartimento di Automatica e Informatica
Torino, Italy

 
 Abstract

    Duplicating variables in high-level code is an effective technique to detect hardware errors in memory and in processor registers. This paper reports an evaluation of the cost (in terms of memory and CPU time overhead) and effectiveness (in terms of error detection) of this simple software redundancy technique for hardware error detection. The analysis allows the designer to correctly choose the best trade-off for its application.  
     

    Introduction 

    Redundancy is a common answer to the increasing demand of high dependability in safety-critical applications. In this paper we consider low-cost computer-based systems; in such systems hardware redundancy is unacceptable and software redundancy is the only design modification which can possibly be introduced to improve the system fault tolerance. 

    Different software fault tolerant approaches have been proposed to address different kinds of error sources [TJPr96][RMSi96]. To achieve a high degree of fail-silent behavior in ordinary computers, the intrinsic Error Detection Mechanisms (EDMs) of the system (exceptions, memory protection, etc.) can be complemented with a set of carefully chosen software error detection techniques [RMSi96]. These techniques include Algorithm Based Fault Tolerance (ABFT) [HuAb84], Assertions, and Code Flow checking. The technique adopted here is based on variable duplication: as a main advantage, it can be automatically implemented on the high-level code of the program. The paper performs some analysis about the cost and effectiveness of this approach. 

    The FATO approach 

    This paper adopts the Software Error Detection approach called FATO [BCPR97], based on source code modification and data redundancy, that aims at increasing the robustness of the system with respect to hardware errors in the memory and registers. The main idea of the approach is to perform two different modifications to the source code; the first one corresponds to duplicating some or all of the program variables in order to introduce data redundancy and modifying all the operators to manage the introduced replica of the variables. The second one aims at introducing consistency checks inside the control flow to periodically verify the consistency between the two copies of each variable. This approach has been developed mainly to detect transient bit-flip faults in memory locations storing the data and in the microprocessor's user registers. 

    Trading-off fault coverage and CPU Time overhead 

    In some cases, safety-critical applications have strict constraints in terms of memory occupation and system performance. The duplication of the whole set of variables and the introduction of a consistency check before every read operation represent the optimum choice from the fault coverage point of view. We consider only faults appearing in the data memory during the life period of each variable. With this assumption, the duplication of the whole set of variables guarantees that 100% of faults is covered by this software redundancy technique. On the other side, by duplicating a lower percent of variables one can trade-off the obtained fault coverage with the CPU time overhead. 

    In this paper we propose an analysis of the code useful to choose a good trade-off between system overhead and fault coverage achieved with variable duplication and consistency checks for a low-cost safety critical computer-based system. An experimental analysis is currently being performed, considering a set of benchmark programs; the preliminary experimental results reported here refer to a representative program which emulates a railway control system in a small railway station with 2 railroads. 

    To avoid the duplication of the whole set of variables, we propose an analysis of the data trace of a fault-free execution of the target application. This allows to determine the life time of each variable (corresponding to the cumulative length of the longest time intervals between a write operation and the following read operations, not containing other write operations). Variables are ordered according to their life time, and those with the largest life time are duplicated. In fact, the larger the life time is, the higher the probability that an error affecting this variable occurs. We assume that all the faults in a duplicated variable are detected by the FATO approach. 

    Fig. 1 shows the CPU time overhead introduced in the benchmark program by duplicating a decreasing number of variables, selected according to the above strategy. The fault coverage achieved by duplicating a subset of variables ranging from 10% to 100% has been experimentally evaluated and reported in Fig. 2. The experimental results with our benchmark show that duplicating 50% of the variables is enough to guarantee a Fault Coverage of 85% of the faults with a reduced CPU time overhead (28%). 

    Chart Grafico1 

    Fig. 1: CPU time overhead vs. the percentage of duplicated variables.

    Chart Grafico1 

    Fig. 2: Fault Coverage vs. the percentage of duplicated variables.

    Another parameter that has to be carefully set to decrease the code size and the execution time overhead is the consistency check granularity. In particular, different solutions can be chosen in terms of where to introduce the consistency checks: 

    1. after each read operation;
    2. at the end of each life period.

    The number (provided that is greater than 1) and position of the consistency checks do not modify the achieved fault coverage: if the variable is duplicated, any fault appearing in its life period is detected. Introducing a check after each read operation is the safest choice, giving the minimum latency of a fault. On the other side, this solution gives a high overhead in terms of CPU time. Introducing a check at the end of each life period increases the latency with a reduced cost in terms of CPU time. A more precise analysis of the link between the number of introduced consistency checks and the latency is currently under analysis. 

    Conclusions  
     
    In this paper we report some data about the costs (in terms of CPU time) and advantages (in terms of Fault Coverage) of a previously proposed technique [BCPR97] for software redundancy to improve fault tolerance in low-cost computer-based safety-critical systems. The proposed approach is based on variable duplication and consistency check introduction. An exhaustive application of this approach could dramatically increase the resources (CPU time and memory) overhead; therefore, a balanced trade-off has to be defined between fault coverage and overhead. We propose to study the system fault-free behavior to decide the variables to be duplicated and checked, ordering the variables taking into account their life period. 

    Work is currently being done in optimizing the automatic modification of the source code in order to tune the number of variables to be duplicated and the granularity of the consistency checks to be inserted in the code, introducing more sophisticated techniques for code analysis. 

    References

    [BCPR97] A. Benso, F. Corno, P. Prinetto, M. Rebaudengo, M. Sonza Reorda, FATO: a software FAult TOlerant approach, 3th-IEEE International On-Line Testing Workshop, July 1997 [HuAb84] K. H. Huang, J. A. Abraham, Algorithm-Based Fault Tolerance for Matrix Operations, IEEE Trans. Computers, vol. 33, Dec 1984, pp. 518-528 
    [PVBW88] D. Powell, P. Verissimo, G. Bonn, F. Waeselynck, D. Seaton, The Delta-4 Approach to Dependability in Open Distributed Computing Systems, Proc. FTCS-18, Tokyo (J), 1988, pp. 246-251 
    [RMSi96] M. Zenha Rela, H. Madeira, J. G. Silva, Experimental Evaluation of the Fail-Silent Behavior in Programs with Consistency Checks, Proc. FTCS-26, Sendaj (J), 1996, pp. 394-403 
    [TJPr96] D. Todd Smith, T. A DeLong, B. W. Johnson, J. A. Profeta III, An Algorithm Based Fault Tolerant Technique for Safety Critical Applications, Reliability and Maintainability Symposium, Philadelphia, PA, 1997 


1. Contact Author:  Politecnico di Torino, Dipartimento di Automatica e Informatica,  Corso Duca degli Abruzzi 24, I-10129, Torino, Italy Phone: +39 11 564 7055. Fax: +39 11 564 7099  E-Mail: sonza@polito.it