Fast Abstracts Archives . .

FastAbstracts


WHAT IS a
FastAbstract

The History

Archives of
FastAbstracts

ISSRE 2003
ISSRE 2002
ISSRE 2001
ISSRE 2000
ISSRE 1999
ISSRE 1998
FTCS 1999
FTCS 1998



 

 

   

 

Recovery languages: an effective structure for software fault tolerance

Vincenzo De Florio 1, Geert Deconinck 2, Rudy Lauwereins 3
Katholieke Universiteit Leuven
Electrical Engineering Department, ACCA Group

 

Already in 1975, Randell [1] introduced the problem of which system structure to adopt for software fault tolerance (SFT). Since his pioneering paper, a number of mechanisms, tools, and environments have been developed which propose their own methods to achieve the dependability requirements of the user application; the interest in software fault tolerance has far from disappeared, as can be argued from the selection by the industry council of FTCS of SFT as one of the most important topics to which further research is necessary. In general, the whole of these methods require the user code to be somewhat instrumented--some systems require a few or no modification at all, others call for a thorough rearrangement of the original code.

A number of research activities have proved [2,3] that these modifications may have a deep impact on both observed performance and dependability. This happens because the user knowledge of his/her own application allows to get the most out of any fault tolerance method.

Unfortunately this process can have considerably high costs attached to it, due to new portion of code to develop and to the increased complexity of the overall system. The resulting code rather often appears as one monolithic software package in which the original task and the fault tolerance method are entangled in such a way as to make it extremely difficult to maintain, port, or even modify the application or the fault tolerance strategy. This is the case, e.g., for a large number of methods that are based on software redundancy or diversity, like e.g., recovery blocks [1], or N-version programming [4]. All the implementations of these methods we know of require to work at application-level embedding the method into the user application. For instance, the HATS system [5] provides among the others C-style constructs for recovery block, N-version programming, exception handling and retry blocks. We do believe that a good system structure for SFT is one which allows the user to separately address the aspects of fault tolerance computing from those pertaining the behaviour that his/her application should have in the absence of faults. Such a structure would allow him/her to tackle more easily and effectively any of these two fronts, for instance, to modify the fault tolerance strategy with a few or no modifications at all in the applicative part, or vice-versa.

We propose a novel system structure for SFT such that only a minor number of changes need to be made on the original user application, while the bulk of the fault tolerance strategies consists of a virtual machine interpreting a separately defined, user provided ``recovery script''. This is written in what we call ``Recovery Language'' (RL), and allows to express the fault tolerance strategies of the user.

Via such script, the user refers to high level entities in his/her application, such as tasks, threads, logical group of threads, and processing nodes, and queries their status what fault tolerance is concerned, e.g., checking whether a particular entity is currently regarded to be faulty, or has ever been rebooted, or is currently loaded on a particular node, and so on. Special high-level actions can also be specified, so as to relocate a particular entity, or to restart or terminate it. In this scheme, the only instrumentation required on the user application is the registering of status information and events in a system database (e.g., the fact that thread 12 belongs to a logical group of threads known as ``group 2'', it is currently in its initialization phase and no error affecting that thread has been currently detected). It is the virtual machine that has access to that information and executes a global strategy to come into action the moment the first error is detected.

We claim this recovery language approach fulfils the requirements we posed for the ideal system structure for SFT. The genericity and validity of our approach is further increased when one considers that no limit exists on the structure of the recovery language to adopt and on the complexity of the virtual machine to set up. For instance, neither the interface with any special fault tolerance resource, such as an augmented operating system or a custom hardware, nor any implementation aspects are part of this view (e.g., a centralized vs. a distributed system database, one or more virtual machines etc). The difficulties and costs of porting and maintaining the fault tolerance section of the user application would mainly regard the virtual machine, and as such could concern a FT specialist instead of an application specialist.

A prototypal version of a recovery language and of its virtual machine have been developed in the framework of the ESPRIT project EFTOS [6]. Figure 1 briefly sketches our approach.

 

Figure 1: A global view of an RL program: the user supplies an RL source code; the rl translator turns it into binary r-codes; the r-codes are interpreted at run time by the RINT virtual machine, which, among other tasks, manages the system database.


In our implementation, a RL specification is a collection of recovery actions guarded by selective conditions. Recovery actions include rebooting or shutting down a node, killing or restarting a thread or a logical group of threads, sending warnings to single or grouped threads, and functions to purge error-records from the system database.

The scheme works as follows: the user writes with RL a recovery strategy, viz., a specification of some actions to be taken to tackle each particular error condition as that error is detected. For each error, the RINT thread is awaken, which interprets the r-code equivalent of the RL source, looking for fulfilled conditions, possibly accessing the system database. Once a condition is met, the corresponding actions are executed, which are supposed to be able to tackle the error. A default action is also available in case no condition is evaluated as true.

With RL the user can express strategies aiming for instance at substituting a suspected task with a non-faulty deputy run elsewhere in the system (see Table 1), or at a graceful degradation of the user application (e.g., by killing those entities who have been detected as faulty; see Table 2).


\begin{table}
\begin{small}
\begin{tabbing}
{\bf 000}\=THEN\=THEN\kill
\\ gt IF ...
 ... THREAD4 AND WARN THREAD2, THREAD3\\ \\ gt FI\end{tabbing}\end{small}\end{table}

Table 1: A recovery rule coded in RL. We suppose a TMR consisting of three threads, identified by integers 1-3. If the first thread is detected as faulty or its state is VFP_FAILURE, then that thread is killed, a new thread is started and the other components are alerted so that they restore a non-faulty TMR. Three of such rules may be used to set up, e.g., a three-and-one-spare system.

Table 2: Another recovery rule coded in RL. We suppose a NMR system consisting of N threads, collectively identified as ``group 1.'' The IF statement checks whether any element of the group has been detected as faulty or is currently in VFP_FAILURE state. If so, those who fulfill the condition (identified in RL as THREAD@) are killed, while those who do not (in RL, THREAD) are warned.


As previously claimed, those above are two different fault tolerance strategies that can be easily applied to the same user application. Any other strategy is up to the user, though no further modification is required. Furthermore, the only kind of instrumentation required does not pertain a particular strategy--on the contrary, it is only meant at registering some entities of the user application, as well as at configuring some basic error detection tools, like watchdog timers and trap handlers. A handful of these tools have been made available within the EFTOS framework.

References

1 B. Randell, ``System Structure for Software Fault Tolerance,'' IEEE TSE, vol.SE-1, pp.220-232, June 1975.
 
2 J. H. Saltzer, D. P. Reed, and D. D. Clark, ``End-to-end arguments in system design,'' ACM Trans. on Comp. Systems, 2(4): 277-288, 1984.
 
3 D. P. Siewiorek and R. S. Swarz, ``Reliable Computer Systems Design and Implementation, Digital Press, 1992.
 
4 A. Avizienis, ``The N-version approach to Fault-Tolerant Software,'' IEEE TSE, vol.SE-11, pp.1491-1501, Dec. 1985.
 
5 Y. Huang and C. M. R. Kintala, ``Software Implemented Fault Tolerance: Technologies and Experience,'' Proc. of FTCS-23, Toulouse, France, pp. 2-9, June 1993.
 
6 G. Deconinck et al., ``Industrial Embedded HPC Applications,'' Int. Journal Supercomputer (ASFRA BV, Edam, The Netherlands) 69 (Vol. XIII, No. 3/4), 1997, pp. 23-44.

1. Author contact: Katholieke Universiteit Leuven, Electrical Engineering Dept., ACCA group, Kard. Mercierlaan 94, B-3001 Heverlee, Belgium. Phone: +32-16-32-1142. Fax: +32-16-32-1986. E-Mail: vincenzo.deflorio@esat.kuleuven.ac.be.
2. Author contact: Katholieke Universiteit Leuven, Electrical Engineering Dept., ACCA group, Kard. Mercierlaan 94, B-3001 Heverlee, Belgium. Phone: +32-16-32-1126. Fax: +32-16-32-1813. E-Mail: geert.deconinck@esat.kuleuven.ac.be.
3. Author contact: Katholieke Universiteit Leuven, Electrical Engineering Dept., ACCA group, Kard. Mercierlaan 94, B-3001 Heverlee, Belgium. Phone: +32-16-32-1035. Fax: +32-16-32-1813. E-Mail: rudy.lauwereins@esat.kuleuven.ac.be.