Micro-Kernel support for Fault-Tolerant Application Development
on Distributed Systems
- C. Pérez,*1 G.
Fabregat,** R. J. Martínez & G. Martín***
- *Dept. of Informatics and Electronics, University
of Valencia
- **Dept. of Informatics, Jaume I University
- ***Robotics Institute, University of Valencia
This abstract describes the work in progress aimed at facilitating the
use of software fault-tolerance (FT) techniques to improve the dependability
of distributed applications developed for systems where every node runs
an instance of a micro-kernel (MK), and user-level processes communicate
through message passing.
Because of the constrains under which applications are usually developed,
a lot of research efforts have been directed in the past to develop transparent
FT mechanisms. However, the wide range of applications and scenarios in
which distributed systems are used, makes it very difficult, if not impossible,
to obtain a good compromise among transparency, dependability, performance
and cost. Instead, developers should be provided with a set of tools which
let them incorporate into each application those techniques of FT that are
more appropriated to fulfil its particular requirements. These tools must
be of course efficient, but also versatile and complete,
adequate to implement a wide range of FT strategies. Besides, they should
be "easy to use", so that even developers under time or economical
constrains can benefit from them.
In order to be able to combine ease of use with efficiency and versatility,
and after identifying the common needs shared by the different FT techniques,
we decided to separate the task of building the set of tools to be offered
to application developers into two steps.
- First, incorporate to the MK a reduced but complete set of basic services
aimed at versatility and efficiency.
- Then develop (by means of servers and libraries using this basic services)
"easy to use" tools appropriated for most application developers.
In this way, typical applications can offer improved dependability at
a reasonable development cost, while for specialized applications whose
requirements justify it, the developer is free to implement specific strategies
by invoking directly the MK services.
Following, we summarize our conclusions about which functions should
be included in the MK and which must be implemented in higher system levels:
- The MK must provide services for process state saving, restoring and
updating. However, each process should be able to select which part of
its state to save, where and when.
- The MK must generate log information about messages exchanged by processes
and about interactions between processes and the MK, such as system calls
or software signals. However, the MK must not manage logs, which must be
created and maintained by user-level processes.
- Regarding IPC, the MK must provide support for efficient management
of intermediate message exchange at user level, because this is required
to implement fault-tolerant communication services and also to process
the results produced by distinct instances of replicated processes. Neither
fault-tolerant group communication nor comparison, voting or selection
of results should be included in the MK, because applications can benefit
from implementing tailored algorithms that exploit their specific characteristics,
as it is shown in [Eln93].
- Finally, the MK must detect its own errors
and those of the components it depends on (hardware, as physical memory,
or software, as a user-level swap server). It must detect as well errors
in processes invoking the services it provides. Nevertheless, since error
detection mechanisms incur in time overhead, processes should be allowed
to, individually, enable or disable them, so they can trade error detection
against performance.
Once the requirements have been established, we have specified the set
of services that must be provided by the MK. The most challenging aspect
of this work has been to design services which let user-level processes
combine the advantages provided by tailored checkpoint tools, such as those
described in [HK93], with those offered by kernel-level services, mainly
incremental and copy-on-write checkpointing. We have called this new services
"Incremental Messages (IMs)", since they support selective
and incremental state saving, updating and restoration through a message
passing interface.
Then we have checked that the MK services accomplish the original objective
of completeness, versatility and efficiency by specifying how to implement
a representative set of FT strategies that use them. This set includes fault-tolerant
services through server groups [CDD90], library checkpoint facilities as
those described in [HK93], the main protocols of Manetho [Eln93], and a
transactional file service [Per98].
We have undertaken the implementation of the MK services on VSTa [Val],
starting by the IMs. We have used IMs to successfully implement library
checkpoint services as well as a transactional file service. Initial results
suggest that our approach effectively is adequate to build a FT application
development suite which provides both "ease to use" FT services
as well as versatile and efficient ones, all of them based on the same set
of MK mechanisms.
Currently, we are developing a communication
server to allow remote IM exchange in VSTa 1.5.1. in order to complete the
performance and dependability study of IMs. We are also concerned about
completing the implementation of the MK interface, so that other people
can undertake research about how to structure the user level.
References
[CDD90] Cristian, F.; Dancey,
B. & Dehn, J. "Fault-Tolerance in the Advanced Automation System."
Proceedings of the 20th Fault-Tolerant Computing Symposium,
pp. 6-17. June 1990.
[Eln93] Elnozahy, E. N. "Manetho: Fault-Tolerance in Distributed
Systems Using Rollback-Recovery and Process Replication." PhD Thesis,
Rice University. October 1993.
[HK93] Huang, Y. & Kintala, C. "Software Implemented Fault Tolerance:
Technologies and Experience." Proceedings of the 23rd
Fault-Tolerant Computing Symposium, pp. 2-9. June 1993.
[Per98] Pérez, C; "Aportaciones a los Entornos de Desarrollo
de Aplicaciones Tolerantes a Fallos." PhD Thesis, University of
Valencia. February 1998.
[PF97] Pérez, C; "Los mensajes incrementales: una herramienta
para el desarrollo de aplicaciones tolerantes a fallos." Actas de
las VIII Jornadas de Paralelismo. September 1997.
[Val] Valencia, A. "An Overview of the VSTa Microkernel." http://bodhi.zendo.com/vsta/vsta_intro.html
Author
contact: Departamento de Informatica y Electronica, Universidad de Valencia,
C/ Doctor Moliner 50, Burjassot, Valencia, Spain, E-46100. Phone: 96-316-0412.
Fax: 96-316-0418. E-Mail: carlos.perez@uv.es. |