An efficient fault-detecting protocol for communicating duplex
systems
Peter Sobe 1
Institute for Computer Systems
Dresden University of Technology
Duplicated processes are used to detect arbitrary
faults of single processes. Several schemes including fault detetction and
treatment are based on duplicated processes (duplex systems). An often applied
approach is the combination with backward recovery. But also forward recovery
schemes [LoFA 90],[PrVa 94]
are suited to convert the state of the duplex system in a fault free state.
The considerations are focused on the usage of an efficient protocol for
the communication of the duplex systems. Also the aspect of the combination
with a variant of roll-forward recovery in reaction of a detected fault
is discussed.
When we use a communication protocol of m2-steps and no acknowledgements
in the fault free case, four messages per user communication step have to
be transmitted. The receiver processes compare the messages coming from
two sender replicas and detect if a message is influenced by a fault. A
reduction of the number of protocol messages can be reached by weakening
the fault-detection capability. This idea is the basis for a new class of
protocols, the communication step overlapping fault detection protocols.
This class is characterized by the weakness that faults are not detected
within the present communication step. First on a later communication step
the fault is detected. We put up with this weakness, because we need the
efficiency. A special realization is the even-odd protocol. The idea is
the result from the following thoughts. If every replica communicates only
with one other replica, so that replica 0 of the sender only sends a message
to replica 0 of the receiver, replicas 1 in the same manner, the protocol
is free of time overhead. Unfortunately this does not provide fault detection
because the system is splitted in two independent systems which interact
only when extra comparisons are done. To maintain fault detection by communication,
we switch between direct and crossing protocol messages alternatively. The
even-odd-protocol uses a counter for every sender-receiver pair s,r (comparable
to sequence numbers). With every communication of this pair the counter
c(s,r) is incremented. If c(s,r) is even the replicas communicates only
in direct manner, otherwise they communicate with crossing messages from
replica 0 to replica 1 and vice versa. Message i+1 contains the hash value
of message i as an extension. In this way, a replica fault which influences
communication step i is detected on communication step i+1.
We assume that the processes communicate more frequently than checkpoint
comparisons are done. Otherwise you may insert communications only for the
sake of fault detetction. An illustration of the m(even), m(odd) and m2
steps is given in Figure 1.
A problem arises by the fault propagation with communication sequences.
If these sequences form a cycle or connect two processes via different processes,
the faulty state can propagate to two replicas of a duplex systems. To prohibit
fault propagation, we use m2-steps if a cyclic sequence is closed
or paths between two application processen via different process pairs are
formed.
The even-odd protocol serves the following
properties:
- Detection of faults during communication
steps. A faulty or a missing message is detected at the next communication
of the same sender receiver pair.
- Efficiency. On average it uses less
messages than the use of m2-steps only. The most communication
patterns require three protocol communication steps per application message
on average. The exclusive usage of m2-steps would need four
protocol communications on average.
- Restricted Fault Propagation. The
insertion of m2-steps prohibits the propagation of single process
faults to two replicas of a process. As a requirement for roll-forward
recovery one replica state remains fault free.
- Applicability in combination with roll-forward
recovery. The system must be splitted in to halfs consistently and
the source of the fault has to be traced. This procedure is based on the
same control information which is used by the even-odd protocol.
Measurements were done on parallel applications. The overhead of process
duplication using the even-odd protocol was compared with the overhead using
a m2-protocol.
On the most examples the protocol can save a quarter of the protocol
messages on average compared to a m2-Protocol. The reduction
of the message number leads to a lower protcol overhead and also to a lower
load of the communication network. Here we present the results using a direct
solver for a system of linear equations as example. Figure 2 shows the protocol
overhead in percent measured with different problem sizes. As algorithm
the gauss-jordan elimination is used. The problem size appears as the number
of unknown variables. This application uses the even-odd protocol less beneficial,
because of some m-steps at the beginning. Therefore some m2-steps
were forced by the application. As result we obtained a more beneficial
selection of protocol steps for the following communications.
The combiantion with roll-forward recovery is based on the following
ideas:
- Backtracing to the duplex system which
caused the fault.When a fault is detected, negative acknowlegdements
are transfered to the sender. A retransmission of the recently sent messages
per m-steps is activated and versions are generated. This is done via the
entire path of a possible fault propagation.
- Decision if faulty locally. With
the retransmitted messages the decision is possible, if the fauly state
was caused by the duplex partner of the same application process or it
was the result of a fault propagation.
- During the program is progressing in
two isolated halfs, the fault is local identified by a retry phase starting
from a recently taken checkpoint. When the fault free version is identified,
every replica of the fault free half copy its state to its partner of the
faulty half.
- Validation of single replica phases.
This may be done later in parallel to the progress of the application
where every application process is a duplex system again.
The advantage to backward recovery is, that only the creation of additional
checkpoints delays the application progress. This is necessary to copy states
from a process to another process or to obtain the basis for a validation
run. The combination of the even-odd protocol and roll-forward recovery
is an optimistic fault tolerance scheme, because the overhead in the fault
free case is low, and the delay in occurence of single faults lower compared
to backward recovery. In less probable scenarios (multiple faults during
a relatively short period), a rollback is necessary.
References
[LoFA 90] J. Long, W.K.
Fuchs and J.A. Abraham,
A Forward Error Recovery Strategie using Checkpointing in Distributed Systems,
In: International Conference on Parallel Processing (Proceedings), August,1990
.
[PrVa 94] D.K. Pradhan
and N.H. Vaidya,
Roll-Forward Checkpointing Scheme: A Novel Fault-Tolerant Architecture,
IEEE Transactions on Computers 43(10), October1994.
Author
contact: Dresden University of Technology, Institute of Computer Systems,
D-01062 Dresden, Germany
Phone: +49 351 4638554. Fax: +49 351 4638245. E-Mail: ps1@irz.inf.tu-dresden.de |