Recovery languages: an effective structure for software fault tolerance
Vincenzo De Florio 1, Geert Deconinck
2, Rudy Lauwereins 3
Katholieke Universiteit Leuven
Electrical Engineering Department, ACCA Group
Already in 1975, Randell [1]
introduced the problem of which system structure to adopt for software
fault tolerance (SFT). Since his pioneering paper, a number of mechanisms,
tools, and environments have been developed which propose their own methods
to achieve the dependability requirements of the user application; the
interest in software fault tolerance has far from disappeared, as can
be argued from the selection by the industry council of FTCS of SFT as
one of the most important topics to which further research is necessary.
In general, the whole of these methods require the user code to be somewhat
instrumented--some systems require a few or no modification at all, others
call for a thorough rearrangement of the original code.
A number of research activities have proved [2,3] that these modifications may have a deep impact on both
observed performance and dependability. This happens because the user
knowledge of his/her own application allows to get the most out of any
fault tolerance method.
Unfortunately this process can have considerably high costs attached
to it, due to new portion of code to develop and to the increased complexity
of the overall system. The resulting code rather often appears as one
monolithic software package in which the original task and the fault tolerance
method are entangled in such a way as to make it extremely difficult to
maintain, port, or even modify the application or the fault tolerance
strategy. This is the case, e.g., for a large number of methods that are
based on software redundancy or diversity, like e.g., recovery blocks [1],
or N-version programming [4]. All the
implementations of these methods we know of require to work at application-level
embedding the method into the user application. For instance, the HATS
system [5] provides among the others C-style
constructs for recovery block, N-version programming, exception
handling and retry blocks. We do believe that a good system structure
for SFT is one which allows the user to separately address the aspects
of fault tolerance computing from those pertaining the behaviour that
his/her application should have in the absence of faults. Such a structure
would allow him/her to tackle more easily and effectively any of these
two fronts, for instance, to modify the fault tolerance strategy with
a few or no modifications at all in the applicative part, or vice-versa.
We propose a novel system structure for SFT such that only a minor number
of changes need to be made on the original user application, while the
bulk of the fault tolerance strategies consists of a virtual machine interpreting
a separately defined, user provided ``recovery script''. This is written
in what we call ``Recovery Language'' (RL), and allows to express the
fault tolerance strategies of the user.
Via such script, the user refers to high level entities in his/her application,
such as tasks, threads, logical group of threads, and processing nodes,
and queries their status what fault tolerance is concerned, e.g., checking
whether a particular entity is currently regarded to be faulty, or has
ever been rebooted, or is currently loaded on a particular node, and so
on. Special high-level actions can also be specified, so as to relocate
a particular entity, or to restart or terminate it. In this scheme, the
only instrumentation required on the user application is the registering
of status information and events in a system database (e.g., the fact
that thread 12 belongs to a logical group of threads known as ``group
2'', it is currently in its initialization phase and no error affecting
that thread has been currently detected). It is the virtual machine that
has access to that information and executes a global strategy to come
into action the moment the first error is detected.
We claim this recovery language approach fulfils the requirements we
posed for the ideal system structure for SFT. The genericity and validity
of our approach is further increased when one considers that no limit
exists on the structure of the recovery language to adopt and on the complexity
of the virtual machine to set up. For instance, neither the interface
with any special fault tolerance resource, such as an augmented operating
system or a custom hardware, nor any implementation aspects are part of
this view (e.g., a centralized vs. a distributed system database, one
or more virtual machines etc). The difficulties and costs of porting and
maintaining the fault tolerance section of the user application would
mainly regard the virtual machine, and as such could concern a FT specialist
instead of an application specialist.
A prototypal version of a recovery language and of its virtual machine
have been developed in the framework of the ESPRIT project EFTOS [6].
Figure 1
briefly sketches our approach.
Figure 1: A global view of an RL program: the user supplies
an RL source code; the rl translator turns it into binary r-codes;
the r-codes are interpreted at run time by the RINT virtual machine, which,
among other tasks, manages the system database.
In our implementation, a RL specification is a collection of recovery
actions guarded by selective conditions. Recovery actions include rebooting
or shutting down a node, killing or restarting a thread or a logical group
of threads, sending warnings to single or grouped threads, and functions
to purge error-records from the system database.
The scheme works as follows: the user writes with RL a recovery strategy,
viz., a specification of some actions to be taken to tackle each particular
error condition as that error is detected. For each error, the RINT thread
is awaken, which interprets the r-code equivalent of the RL source, looking
for fulfilled conditions, possibly accessing the system database. Once
a condition is met, the corresponding actions are executed, which are
supposed to be able to tackle the error. A default action is also available
in case no condition is evaluated as true.
With RL the user can express strategies aiming for instance at substituting
a suspected task with a non-faulty deputy run elsewhere in the system
(see Table 1),
or at a graceful degradation of the user application (e.g., by killing
those entities who have been detected as faulty; see Table 2).
Table 1: A recovery rule coded in RL. We suppose a TMR
consisting of three threads, identified by integers 1-3. If the first
thread is detected as faulty or its state is VFP_FAILURE, then that thread
is killed, a new thread is started and the other components are alerted
so that they restore a non-faulty TMR. Three of such rules may be used
to set up, e.g., a three-and-one-spare system.
Table 2: Another recovery rule coded in RL. We suppose
a NMR system consisting of N threads, collectively identified as
``group 1.'' The IF statement checks whether any element of the group
has been detected as faulty or is currently in VFP_FAILURE state. If so,
those who fulfill the condition (identified in RL as THREAD@) are killed,
while those who do not (in RL, THREAD ) are warned.
As previously claimed, those above are two different fault tolerance strategies
that can be easily applied to the same user application. Any other strategy
is up to the user, though no further modification is required. Furthermore,
the only kind of instrumentation required does not pertain a particular
strategy--on the contrary, it is only meant at registering some entities
of the user application, as well as at configuring some basic error detection
tools, like watchdog timers and trap handlers. A handful of these tools
have been made available within the EFTOS framework.
References
- 1 B. Randell, ``System Structure
for Software Fault Tolerance,'' IEEE TSE, vol.SE-1, pp.220-232, June
1975.
-
- 2 J. H. Saltzer, D. P. Reed,
and D. D. Clark, ``End-to-end arguments in system design,'' ACM Trans.
on Comp. Systems, 2(4): 277-288, 1984.
-
- 3 D. P. Siewiorek and R. S. Swarz,
``Reliable Computer Systems Design and Implementation, Digital
Press, 1992.
-
- 4 A. Avizienis, ``The N-version
approach to Fault-Tolerant Software,'' IEEE TSE, vol.SE-11, pp.1491-1501,
Dec. 1985.
-
- 5 Y. Huang and C. M. R. Kintala,
``Software Implemented Fault Tolerance: Technologies and Experience,''
Proc. of FTCS-23, Toulouse, France, pp. 2-9, June 1993.
-
- 6 G. Deconinck et al., ``Industrial
Embedded HPC Applications,'' Int. Journal Supercomputer (ASFRA BV, Edam,
The Netherlands) 69 (Vol. XIII, No. 3/4), 1997, pp. 23-44.
Author
contact: Katholieke Universiteit Leuven, Electrical Engineering Dept.,
ACCA group, Kard. Mercierlaan 94, B-3001 Heverlee, Belgium. Phone: +32-16-32-1142.
Fax: +32-16-32-1986. E-Mail: vincenzo.deflorio@esat.kuleuven.ac.be.
Author
contact: Katholieke Universiteit Leuven, Electrical Engineering Dept.,
ACCA group, Kard. Mercierlaan 94, B-3001 Heverlee, Belgium. Phone: +32-16-32-1126.
Fax: +32-16-32-1813. E-Mail: geert.deconinck@esat.kuleuven.ac.be.
Author
contact: Katholieke Universiteit Leuven, Electrical Engineering Dept.,
ACCA group, Kard. Mercierlaan 94, B-3001 Heverlee, Belgium. Phone: +32-16-32-1035.
Fax: +32-16-32-1813. E-Mail: rudy.lauwereins@esat.kuleuven.ac.be. |