Chameleon: A Software Infrastructure for Adaptive Fault Tolerance

S.Bagchi, K. Whisnant, Z.Kalbarczyk, R.K.Iyer
Center for Reliable and High-Performance Computing
University of Illinois at Urbana-Champaign
1308 W. Main St., Urbana, IL 61801
E-mail: [bagchi, kwhisnan, kalbar, iyer]@crhc.uiuc.edu

1 Introduction

In networked computing systems, a broad range of commercial and scientific applications that need varying degrees of availability must coexist. It is not cost effective to develop a reliable platform in each case. It is more efficient to build an infrastructure that provides the required level of dependability for each application's needs. It is also essential that the proposed alternatives should leverage off-the-shelf components. There have been exhaustive studies on fault tolerance strategies capable of providing efficient mechanisms to deal with system operational failures. Most of this work has focused on specific application needs and thus provided only piecemeal solutions. Little work has been done in addressing how to build a reliable networked computing system out of unreliable computation nodes. As a result, there is no comprehensive solution for providing a wide range of fault-tolerant services in a single networked environment. The most feasible way of understanding how such a software environment would fit on top of existing layers (the operating system, the network interfaces, etc.) is to implement an infrastructure for providing a range of reliable services. Fundamental components of the envisioned infrastructure (Chameleon) have been designed so that none of them is a single point of failure. Each of the components is active for a certain period, e.g., during the setting up the system configuration. If a component fails during its active phase, there is a provision for recovery, either by switching to a backup or by regenerating the component.

2 Related Research

Current approaches to reliable networked computing are based mainly on exploiting distributed groups of cooperating processes. A number of studies e.g., [1] address various aspects of this paradigm (such as process synchronization, jobs distribution, fault tolerance strategies). These studies have yielded a number of tools to support the construction of reliable services, such as Horus [5]. Most of these approaches require a specialized and complex software layer that must be installed in each computation node. A primary objective of developing these systems is to provide a software environment for constructing and executing distributed applications. Some aspects of service availability are addressed in the Piranha tool, [3], which exploits the dynamic replication of objects for achieving high availability. Recently, the "Wolfpack", Microsoft( clustering technology provides clustering extensions to Windows NT( for improving service availability and system scalability [6].

3 Chameleon Infrastructure

Chameleon (see Figure 1) provides an adaptive infrastructure that supports different levels of availability requirements simultaneously in a single, heterogeneous, clustered environment.

Fundamental components of Chameleon [2] include: (1) a Fault Tolerance Manager (FTM) acting as an independent and intelligent entity capable of identifying and establishing the required fault tolerance strategy for executing the user application, (2) Reliable, Mobile and Intelligent ARMORs capable of migrating through the network and operating autonomously on behalf of the FTM according to built-in specifications and instructions, (3) a Surrogate Manager operating as a pseudomanager for one particular application and capable of interacting with the user and supporting proper communications with the ARMORs monitoring the application execution on remote hosts, (4) Host Daemons residing on each host and responsible for handshaking with the ARMORs and monitoring the ARMORs' behavior, and (5) Software Libraries providing basic building blocks to create or re-engineer ARMORs.

 

 

 

 

 

 

 

 

 

 

In the initialization phase, the FTM collects information about the system configuration and characteristics of individual nodes. Initialization ARMORs are sent to hosts to obtain this data and to install the host daemons on participating machines. After successful initialization, Chameleon is ready to accept user requests. When a request arrives, the FTM designates a query ARMOR to acquire the necessary information on the application specifics, such as the required availability level, needed system resources, types of results, etc. Based on this information, the FTM identifies the necessary fault tolerance strategy and designates set of ARMORs to initiate and monitor the application. Creation of ARMORs is performed according to a predefined procedure that utilizes two software libraries: (1) a library of building blocks and (2) a library of ARMORs.

ARMORs designated to support the application execution migrate through the network to the selected nodes, install themselves on the machines and initiate the application execution. In addition, a Surrogate Manager is spawned by the FTM and associated with the application. It maintains copy of the system information supported by the FTM, provides reliable communications with the user, and supervises the ARMORs that monitor the application execution on the remote hosts.

To ensure a rapid reaction to the application failures, the application is watched by the ARMOR that installed it at the remote host and started execution. The ARMOR communicates detectable application misbehavior to the appropriate Surrogate Manager. Because the ARMOR itself may fail, it is watched by the Host Daemon, which is capable of notifying the Surrogate Manager about ARMOR failures. The Surrogate Manager can regenerate a new ARMOR either to complete or to restart the application. Once generated ARMORs and Surrogate Managers can act autonomously, and the FTM is free to serve other user requests. Because the Surrogate Managers are capable of performing the basic functions of the FTM, the application may complete even in the presence of FTM failure. Errors and failures of the Surrogate Manager are directly reported to the FTM by the local Host Daemon monitoring the Surrogate Manager. The FTM, then re-initializes the application execution just as it would handle a new request. In order to detect node failures, the FTM uses heartbeat messages, which are sent with a predefined frequency. In the case of a node failure, the application(s) executed on the node are migrated to other available nodes. To operate reliably, the FTM must be resilient to errors. A possible solution is to support a passive backup FTM, which supports the system information and is updated each time the system state changes.

The Chameleon implementation is based on widely available scripting languages, such as TCL, and high-level programming languages, such as C++. The aim is to provide a relatively thin software layer that must be present in each machine in the environment. The Chameleon environment is not based on CORBA [4], the Object Management Group's standard for building distributed systems. To build Chameleon around CORBA would necessitate a full CORBA implementation on all the nodes in the system. While vendors are increasingly providing applications conforming to CORBA specifications, the vast majority of the existing applications were not built around CORBA objects. Hence, in an environment like Chameleon, which addresses availability needs in a general system of networked workstations (as opposed to specialized systems built with CORBA objects, for example), it was not considered desirable to impose the overhead of CORBA object handling.

4 Concluding Remarks

To demonstrate the capabilities of Chameleon, the environment has been implemented on a small LAN (two Sun workstations running Solaris OS and two Intel PCs running Linux) of heterogeneous machines connected to the high-speed Myrinet switch. The prototype implementation shows the feasibility of the approach. The ARMORs provide a useful abstraction for migration of execution to remote platforms for supporting the user's needs. At the same time, the environment code is thin enough to be integrated as a layer on top of existing operating systems. Substantial future work will be directed towards setting up the general ARMOR framework to make it more flexible to application requirements. A graphical interface needs to be incorporated into the environment to enable the user to monitor the application as well as interact with the FTM.

Acknowledgment

This research was supported by NASA under grant NAG 1-613, in cooperation with the Illinois Computer Laboratory for Aerospace Systems and Software.

References

[1] Dolev D., D. Malki, "The Transis Approach to High Availability Cluster Communication," Comm. of the ACM, Vol. 39, No. 4, 1996.

[2] Iyer R.K., Z. Kalbarczyk, S. Bagchi, "Chameleon: A Software Infrastructure and Testbed for Reliable High-Speed Network Computing," Report CRHC-97-13, University of Illinois at Urbana-Champaign, July 1997.

[3] Maffeis S., "Piranha: A CORBA Tool for High Availability," IEEE Computer, Vol.. 30, No. 4, April 1997.

[4] Object Management Group, The Common Object Request Broker: Architecture and Specification (CORBA) Revision 2.0, 1995.

[5] van Renesse R., K.P. Birman, S. Maffeis, "Horus: A Flexible Group Communication System," Comm. of the ACM, Vol. 39, No. 4, 1996.

[6] Microsoft Clustering Architecture "Wolfpack," White Paper, May 1997, http://www.microsoft.com/ntserver/info/wolfpack.htm