Integration of fault tolerance techniques: a system of patterns to cope with hardware, software and environmental fault tolerance.

 

Luciane Lamour Ferreira - Cecília M. F. Rubira
Institute of Computing - State University of Campinas
P.O. Box 6176, Campinas, SP 13083-970
e-mail: {972311, cmrubira}@dcc.unicamp.br

 

Techniques for achieving fault tolerance depend upon the effective deployment and utilization of redundancy[LA90]. The incorporation of redundancy in a software system requires a structured and disciplined approach; otherwise it may increase the complexity of the system and consequently it may decrease, rather than increase, the system’s robustness. Ideally, one should consider the integration of hardware, software and environmental fault tolerance to cope with the various kinds of faults that can appear in a software system. Hardware fault tolerance[LA90] applies object replication to enhance the system availability/reliability in the presence of hardware faults; software fault tolerance[LA90] applies software redundancy by means of diversity of design to tolerate software faults that can occur at the design, programming or maintaining phases of software development cycle; and environmental fault tolerance[Rub94] copes with faults that can occur in real world entities in the problem domain and applies redundancy to represent the different abnormal behavior phases that the correspondent objects in the solution domain can present. In this paper, we present a system of design patterns[GHJV95] that provides a uniform solution to the incorporation of redundancy in an object-oriented fault-tolerant system. First, we describe a more general pattern that defines a common structure that can be applied to the three kinds of fault tolerance to implement software redundancy. This pattern is called Software Redundancy Metapattern[Pre95], because it is an abstract pattern that should be customized to solve a specific fault tolerance technique. Then, we show how this metapattern can be applied to implement environmental, software and hardware fault tolerance techniques in order to generate more concrete design patterns[GHJV95].

Metapattern: Software Redundancy Metapattern

The problem: In general fault tolerance techniques use software redundancy for building fault-tolerant application. It is interesting to standardize an object-oriented design solution to manage this redundancy, so that the same solution can be reused to the different kinds of fault tolerance. This solution should emphasize the separation of control aspects that implements the fault tolerance mechanisms (i.e. the non-functional requirements) from the functional aspects of the application. Ideally, software redundancy and its control should be introduced transparently, preferably in a non-intrusive manner[BRL97].

The solution: We abstract the similarities of redundancy control employed by environmental, software and hardware fault tolerance and define a meta-level state machine that can implement a general control mechanism based on states, transitions and events. Using a meta-level architecture, we can divide a fault-tolerant application in two layers: the meta level and the base level. The state machine elements are defined at the meta level by means of metaobjects responsible for implementing the control aspects of the redundant components defined at the base level. These metaobjects can be: MetaStates, MetaTransitions and MetaController (Figure 1). Events correspond to service requests to a fault-tolerant component at the base level witch are intercepted using a Metaobject Protocol (MOP) provided by the meta-level architecture. These events are materialized by means of metaobjects that perform the reflective computation related to the fault tolerance aspects.

MetaState: this class is responsible for creating and initializing the redundant components at the base level, and delegating the service execution to the redundant components. MetaState objects may also implement additional activities related to a specific fault tolerance technique, for instance, the N-Version technique to implement software fault tolerance.

MetaTransition: this class represents the control algorithm of the mechanism used to execute the redundant components. A MetaTransition object verifies if a event is valid to cause a transition to a next state, verifies conditions and it can perform exception handling. The MetaTransition class keeps a reference to the next MetaState object that can be reached. This reference is passed to the MetaController object so that it can change the current state of the state machine.

MetaController: this class is responsible for intercepting the service requests for the fault-tolerant component, materializing and delegating it to other metaobjects perform the reflective computation related to fault tolerance technique. This class is also responsible for creating and configuring the state machine, and maintaining the current MetaState object.

Variations of the proposed metapattern

Following, we discuss some variations of the metapattern for implementing environmental, software and hardware fault tolerance.

(1) In environmental fault tolerance[Rub94], redundant components correspond to state objects that encapsulate different service implementations, which represent the normal and abnormal behavior phases of these components. A state transition occurs when an exception signals that the component has changed from the normal to the abnormal behavior phase. To handle a state dependent service, the current MetaState object should delegate it to the state object at the base level. The current MetaState object should also broadcast the event handling to the MetaTransitions objects, so that they can verify if the event causes a state transition.

(2) In software fault tolerance, redundant components correspond to the different versions of the fault-tolerant component services. These versions are encapsulated by objects at the base level. A MetaState object has a reference to a version object at base level and delegates to it the execution of the services. The result of the service execution is returned to the MetaState object. Then the MetaState object delegates this result for the MetaTransitions which handle them. For example, a MetaTransition object can implement either the Acceptance Test of the Recovery-Block technique or the Voter of the N-Version technique.

(3) In hardware fault tolerance, the redundancy is provided by object replication. For instance, if a primary copy fails, a secondary copy will be executed to provide the same service. The redundant copies may be located in different computers in a distributed system, and the MetaState objects are responsible for implementing the transparency of locality. The MetaState has a reference to the remote object, and should initialize it with the current state of the system and control the execution of the services through the network. The MetaTransitions are responsible for handling the runtime exceptions generated by a faulty copy, and activating a secondary copy.

These concrete design patterns can be combined to deal with hardware, software and environment faults at the same time. One possible sequence for applying the patterns is the following. First, one can apply the environmental fault tolerance pattern to cope with environmental faults. Then, the software fault tolerance pattern can be applied to implement the n-versions of the state objects. Finally, the hardware fault tolerance pattern can be applied to implement the replication of the redundant components. Other combinations of the patterns are possible to enhance the system reliability/availability, and these combinations will depend of the system’s requirements.

 

References

[BRL97] L.E.Buzato, C.M.F.Rubira and M.L.Lisboa. A Reflective Object-Oriented Architecture for Developing Fault-Tolerant Software. Journal of the Brazilian Computer Society, 4(2):39-48. November 1997.

[GHJV95] E.Gamma, R. Helm, R Johnson e J. Vlissides. Design Patterns: Elements of Reusable Object Oriented Software. Addison-Wesley, 1995.

[LA90] A.Lee & T.Anderson. Fault Tolerance: Principles and Practice, Springer Verlag, 1990.

[Pre95] W. Pree. Design Patterns for Object-Oriented software Development, Addison-Wesley,1995.

[Rub94] C.M.F. Rubira. Structuring Fault-Tolerant Object-Oriented Systems Using Inheritance and Delegation. PhD thesis, Dept. of Computing Science, University of Newcastle upon Tyne, october, 1994.