Chameleon: A Software Infrastructure for Adaptive Fault Tolerance
S.Bagchi, K. Whisnant, Z.Kalbarczyk, R.K.Iyer
Center for Reliable and High-Performance Computing
University of Illinois at Urbana-Champaign
1308 W. Main St., Urbana, IL 61801
E-mail: [bagchi, kwhisnan, kalbar, iyer]@crhc.uiuc.edu
1 Introduction
In networked computing
systems, a broad range of commercial and scientific applications that need
varying degrees of availability must coexist. It is not cost effective to
develop a reliable platform in each case. It is more efficient to build
an infrastructure that provides the required level of dependability for
each application's needs. It is also essential that the proposed alternatives
should leverage off-the-shelf components. There have been exhaustive studies
on fault tolerance strategies capable of providing efficient mechanisms
to deal with system operational failures. Most of this work has focused
on specific application needs and thus provided only piecemeal solutions.
Little work has been done in addressing how to build a reliable networked
computing system out of unreliable computation nodes. As a result, there
is no comprehensive solution for providing a wide range of fault-tolerant
services in a single networked environment. The most feasible way of understanding
how such a software environment would fit on top of existing layers (the
operating system, the network interfaces, etc.) is to implement an infrastructure
for providing a range of reliable services. Fundamental components of the
envisioned infrastructure (Chameleon) have been designed so that none of
them is a single point of failure. Each of the components is active for
a certain period, e.g., during the setting up the system configuration.
If a component fails during its active phase, there is a provision for recovery,
either by switching to a backup or by regenerating the component.
2 Related Research
Current approaches to reliable
networked computing are based mainly on exploiting distributed groups of
cooperating processes. A number of studies e.g., [1] address various aspects
of this paradigm (such as process synchronization, jobs distribution, fault
tolerance strategies). These studies have yielded a number of tools to support
the construction of reliable services, such as Horus [5]. Most of these
approaches require a specialized and complex software layer that must be
installed in each computation node. A primary objective of developing these
systems is to provide a software environment for constructing and executing
distributed applications. Some aspects of service availability are addressed
in the Piranha tool, [3], which exploits the dynamic replication of objects
for achieving high availability. Recently, the "Wolfpack", Microsoft(
clustering technology provides clustering extensions to Windows NT( for
improving service availability and system scalability [6].
3 Chameleon Infrastructure
Chameleon (see Figure 1)
provides an adaptive infrastructure that supports different levels of availability
requirements simultaneously in a single, heterogeneous, clustered environment.
Fundamental components of Chameleon [2] include: (1)
a Fault Tolerance Manager (FTM) acting as an independent and intelligent
entity capable of identifying and establishing the required fault tolerance
strategy for executing the user application, (2) Reliable, Mobile and Intelligent
ARMORs capable of migrating through the network and operating autonomously
on behalf of the FTM according to built-in specifications and instructions,
(3) a Surrogate Manager operating as a pseudomanager for one particular
application and capable of interacting with the user and supporting proper
communications with the ARMORs monitoring the application execution on remote
hosts, (4) Host Daemons residing on each host and responsible for handshaking
with the ARMORs and monitoring the ARMORs' behavior, and (5) Software Libraries
providing basic building blocks to create or re-engineer ARMORs.
In the initialization phase, the FTM collects information about the system
configuration and characteristics of individual nodes. Initialization ARMORs
are sent to hosts to obtain this data and to install the host daemons on
participating machines. After successful initialization, Chameleon is ready
to accept user requests. When a request arrives, the FTM designates a query
ARMOR to acquire the necessary information on the application specifics,
such as the required availability level, needed system resources, types
of results, etc. Based on this information, the FTM identifies the necessary
fault tolerance strategy and designates set of ARMORs to initiate and monitor
the application. Creation of ARMORs is performed according to a predefined
procedure that utilizes two software libraries: (1) a library of building
blocks and (2) a library of ARMORs.
ARMORs designated to support the application execution
migrate through the network to the selected nodes, install themselves on
the machines and initiate the application execution. In addition, a Surrogate
Manager is spawned by the FTM and associated with the application. It maintains
copy of the system information supported by the FTM, provides reliable communications
with the user, and supervises the ARMORs that monitor the application execution
on the remote hosts.
To ensure a rapid reaction to the application failures,
the application is watched by the ARMOR that installed it at the remote
host and started execution. The ARMOR communicates detectable application
misbehavior to the appropriate Surrogate Manager. Because the ARMOR itself
may fail, it is watched by the Host Daemon, which is capable of notifying
the Surrogate Manager about ARMOR failures. The Surrogate Manager can regenerate
a new ARMOR either to complete or to restart the application. Once generated
ARMORs and Surrogate Managers can act autonomously, and the FTM is free
to serve other user requests. Because the Surrogate Managers are capable
of performing the basic functions of the FTM, the application may complete
even in the presence of FTM failure. Errors and failures of the Surrogate
Manager are directly reported to the FTM by the local Host Daemon monitoring
the Surrogate Manager. The FTM, then re-initializes the application execution
just as it would handle a new request. In order to detect node failures,
the FTM uses heartbeat messages, which are sent with a predefined frequency.
In the case of a node failure, the application(s) executed on the node are
migrated to other available nodes. To operate reliably, the FTM must be
resilient to errors. A possible solution is to support a passive backup
FTM, which supports the system information and is updated each time the
system state changes.
The Chameleon implementation
is based on widely available scripting languages, such as TCL, and high-level
programming languages, such as C++. The aim is to provide a relatively thin
software layer that must be present in each machine in the environment.
The Chameleon environment is not based on CORBA [4], the Object Management
Group's standard for building distributed systems. To build Chameleon around
CORBA would necessitate a full CORBA implementation on all the nodes in
the system. While vendors are increasingly providing applications conforming
to CORBA specifications, the vast majority of the existing applications
were not built around CORBA objects. Hence, in an environment like Chameleon,
which addresses availability needs in a general system of networked workstations
(as opposed to specialized systems built with CORBA objects, for example),
it was not considered desirable to impose the overhead of CORBA object handling.
4 Concluding Remarks
To demonstrate the capabilities
of Chameleon, the environment has been implemented on a small LAN (two Sun
workstations running Solaris OS and two Intel PCs running Linux) of heterogeneous
machines connected to the high-speed Myrinet switch. The prototype implementation
shows the feasibility of the approach. The ARMORs provide a useful abstraction
for migration of execution to remote platforms for supporting the user's
needs. At the same time, the environment code is thin enough to be integrated
as a layer on top of existing operating systems. Substantial future work
will be directed towards setting up the general ARMOR framework to make
it more flexible to application requirements. A graphical interface needs
to be incorporated into the environment to enable the user to monitor the
application as well as interact with the FTM.
Acknowledgment
This research was supported
by NASA under grant NAG 1-613, in cooperation with the Illinois Computer
Laboratory for Aerospace Systems and Software.
References
[1] Dolev D., D. Malki, "The Transis Approach to High Availability
Cluster Communication," Comm. of the ACM, Vol. 39, No. 4, 1996.
[2] Iyer R.K., Z. Kalbarczyk, S. Bagchi, "Chameleon: A Software
Infrastructure and Testbed for Reliable High-Speed Network Computing,"
Report CRHC-97-13, University of Illinois at Urbana-Champaign, July 1997.
[3] Maffeis S., "Piranha: A CORBA Tool for High Availability,"
IEEE Computer, Vol.. 30, No. 4, April 1997.
[4] Object Management Group, The Common Object Request Broker:
Architecture and Specification (CORBA) Revision 2.0, 1995.
[5] van Renesse R., K.P. Birman, S. Maffeis, "Horus: A Flexible
Group Communication System," Comm. of the ACM, Vol. 39, No.
4, 1996.
[6] Microsoft Clustering Architecture "Wolfpack," White Paper,
May 1997, http://www.microsoft.com/ntserver/info/wolfpack.htm
|