Integration of fault tolerance techniques: a system of patterns
to cope with hardware, software and environmental fault tolerance.
- Luciane Lamour Ferreira - Cecília M. F. Rubira
- Institute of Computing - State University of Campinas
- P.O. Box 6176, Campinas, SP 13083-970
- e-mail: {972311, cmrubira}@dcc.unicamp.br
Techniques for achieving fault tolerance depend upon the effective deployment
and utilization of redundancy[LA90]. The
incorporation of redundancy in a software system requires a structured and
disciplined approach; otherwise it may increase the complexity of the system
and consequently it may decrease, rather than increase, the systems
robustness. Ideally, one should consider the integration of hardware, software
and environmental fault tolerance to cope with the various kinds of faults
that can appear in a software system. Hardware fault tolerance[LA90] applies object replication to enhance the system availability/reliability
in the presence of hardware faults; software fault tolerance[LA90]
applies software redundancy by means of diversity of design to tolerate
software faults that can occur at the design, programming or maintaining
phases of software development cycle; and environmental fault tolerance[Rub94] copes with faults that can occur in real world
entities in the problem domain and applies redundancy to represent the different
abnormal behavior phases that the correspondent objects in the solution
domain can present. In this paper, we present a system of design patterns[GHJV95] that provides a uniform solution to the incorporation
of redundancy in an object-oriented fault-tolerant system. First, we describe
a more general pattern that defines a common structure that can be applied
to the three kinds of fault tolerance to implement software redundancy.
This pattern is called Software Redundancy Metapattern[Pre95],
because it is an abstract pattern that should be customized to solve a specific
fault tolerance technique. Then, we show how this metapattern can be applied
to implement environmental, software and hardware fault tolerance techniques
in order to generate more concrete design patterns[GHJV95].
Metapattern: Software Redundancy Metapattern
The problem: In general fault tolerance techniques use
software redundancy for building fault-tolerant application. It is interesting
to standardize an object-oriented design solution to manage this redundancy,
so that the same solution can be reused to the different kinds of fault
tolerance. This solution should emphasize the separation of control aspects
that implements the fault tolerance mechanisms (i.e. the non-functional
requirements) from the functional aspects of the application. Ideally, software
redundancy and its control should be introduced transparently, preferably
in a non-intrusive manner[BRL97].
The solution: We abstract the similarities of redundancy
control employed by environmental, software and hardware fault tolerance
and define a meta-level state machine that can implement a general control
mechanism based on states, transitions and events. Using a meta-level architecture,
we can divide a fault-tolerant application in two layers: the meta level
and the base level. The state machine elements are defined at the meta level
by means of metaobjects responsible for implementing the control aspects
of the redundant components defined at the base level. These metaobjects
can be: MetaStates, MetaTransitions and MetaController (Figure 1). Events
correspond to service requests to a fault-tolerant component at the base
level witch are intercepted using a Metaobject Protocol (MOP) provided by
the meta-level architecture. These events are materialized by means of metaobjects
that perform the reflective computation related to the fault tolerance aspects.
MetaState: this class is responsible for creating and initializing
the redundant components at the base level, and delegating the service execution
to the redundant components. MetaState objects may also implement additional
activities related to a specific fault tolerance technique, for instance,
the N-Version technique to implement software fault tolerance.
MetaTransition: this class represents the control algorithm of
the mechanism used to execute the redundant components. A MetaTransition
object verifies if a event is valid to cause a transition to a next state,
verifies conditions and it can perform exception handling. The MetaTransition
class keeps a reference to the next MetaState object that can be reached.
This reference is passed to the MetaController object so that it can change
the current state of the state machine.
MetaController: this class is responsible for intercepting the
service requests for the fault-tolerant component, materializing and delegating
it to other metaobjects perform the reflective computation related to fault
tolerance technique. This class is also responsible for creating and configuring
the state machine, and maintaining the current MetaState object.
Variations of the proposed metapattern
Following, we discuss some variations of the metapattern for implementing
environmental, software and hardware fault tolerance.
(1) In environmental fault tolerance[Rub94],
redundant components correspond to state objects that encapsulate different
service implementations, which represent the normal and abnormal behavior
phases of these components. A state transition occurs when an exception
signals that the component has changed from the normal to the abnormal behavior
phase. To handle a state dependent service, the current MetaState object
should delegate it to the state object at the base level. The current MetaState
object should also broadcast the event handling to the MetaTransitions objects,
so that they can verify if the event causes a state transition.
(2) In software fault tolerance, redundant components correspond to the
different versions of the fault-tolerant component services. These versions
are encapsulated by objects at the base level. A MetaState object has a
reference to a version object at base level and delegates to it the execution
of the services. The result of the service execution is returned to the
MetaState object. Then the MetaState object delegates this result for the
MetaTransitions which handle them. For example, a MetaTransition object
can implement either the Acceptance Test of the Recovery-Block technique
or the Voter of the N-Version technique.
(3) In hardware fault tolerance, the redundancy is provided by object
replication. For instance, if a primary copy fails, a secondary copy will
be executed to provide the same service. The redundant copies may be located
in different computers in a distributed system, and the MetaState objects
are responsible for implementing the transparency of locality. The MetaState
has a reference to the remote object, and should initialize it with the
current state of the system and control the execution of the services through
the network. The MetaTransitions are responsible for handling the runtime
exceptions generated by a faulty copy, and activating a secondary copy.
These concrete design patterns can be combined to deal with hardware,
software and environment faults at the same time. One possible sequence
for applying the patterns is the following. First, one can apply the environmental
fault tolerance pattern to cope with environmental faults. Then, the software
fault tolerance pattern can be applied to implement the n-versions of the
state objects. Finally, the hardware fault tolerance pattern can be applied
to implement the replication of the redundant components. Other combinations
of the patterns are possible to enhance the system reliability/availability,
and these combinations will depend of the systems requirements.
References
[BRL97] L.E.Buzato, C.M.F.Rubira and M.L.Lisboa.
A Reflective Object-Oriented Architecture for Developing Fault-Tolerant
Software. Journal of the Brazilian Computer Society, 4(2):39-48. November
1997.
[GHJV95] E.Gamma, R. Helm, R Johnson e J. Vlissides.
Design Patterns: Elements of Reusable Object Oriented Software. Addison-Wesley,
1995.
[LA90] A.Lee & T.Anderson. Fault Tolerance:
Principles and Practice, Springer Verlag, 1990.
[Pre95] W. Pree. Design Patterns for Object-Oriented
software Development, Addison-Wesley,1995.
[Rub94] C.M.F. Rubira. Structuring Fault-Tolerant
Object-Oriented Systems Using Inheritance and Delegation. PhD thesis,
Dept. of Computing Science, University of Newcastle upon Tyne, october,
1994.
|