An efficient fault-detecting protocol for communicating duplex systems

Peter Sobe 1
Institute for Computer Systems
Dresden University of Technology

Duplicated processes are used to detect arbitrary faults of single processes. Several schemes including fault detetction and treatment are based on duplicated processes (duplex systems). An often applied approach is the combination with backward recovery. But also forward recovery schemes [LoFA 90],[PrVa 94] are suited to convert the state of the duplex system in a fault free state. The considerations are focused on the usage of an efficient protocol for the communication of the duplex systems. Also the aspect of the combination with a variant of roll-forward recovery in reaction of a detected fault is discussed.

When we use a communication protocol of m2-steps and no acknowledgements in the fault free case, four messages per user communication step have to be transmitted. The receiver processes compare the messages coming from two sender replicas and detect if a message is influenced by a fault. A reduction of the number of protocol messages can be reached by weakening the fault-detection capability. This idea is the basis for a new class of protocols, the communication step overlapping fault detection protocols. This class is characterized by the weakness that faults are not detected within the present communication step. First on a later communication step the fault is detected. We put up with this weakness, because we need the efficiency. A special realization is the even-odd protocol. The idea is the result from the following thoughts. If every replica communicates only with one other replica, so that replica 0 of the sender only sends a message to replica 0 of the receiver, replicas 1 in the same manner, the protocol is free of time overhead. Unfortunately this does not provide fault detection because the system is splitted in two independent systems which interact only when extra comparisons are done. To maintain fault detection by communication, we switch between direct and crossing protocol messages alternatively. The even-odd-protocol uses a counter for every sender-receiver pair s,r (comparable to sequence numbers). With every communication of this pair the counter c(s,r) is incremented. If c(s,r) is even the replicas communicates only in direct manner, otherwise they communicate with crossing messages from replica 0 to replica 1 and vice versa. Message i+1 contains the hash value of message i as an extension. In this way, a replica fault which influences communication step i is detected on communication step i+1.

We assume that the processes communicate more frequently than checkpoint comparisons are done. Otherwise you may insert communications only for the sake of fault detetction. An illustration of the m(even), m(odd) and m2 steps is given in Figure 1.

A problem arises by the fault propagation with communication sequences. If these sequences form a cycle or connect two processes via different processes, the faulty state can propagate to two replicas of a duplex systems. To prohibit fault propagation, we use m2-steps if a cyclic sequence is closed or paths between two application processen via different process pairs are formed.

The even-odd protocol serves the following properties:

  • Detection of faults during communication steps. A faulty or a missing message is detected at the next communication of the same sender receiver pair.
  • Efficiency. On average it uses less messages than the use of m2-steps only. The most communication patterns require three protocol communication steps per application message on average. The exclusive usage of m2-steps would need four protocol communications on average.
  • Restricted Fault Propagation. The insertion of m2-steps prohibits the propagation of single process faults to two replicas of a process. As a requirement for roll-forward recovery one replica state remains fault free.
  • Applicability in combination with roll-forward recovery. The system must be splitted in to halfs consistently and the source of the fault has to be traced. This procedure is based on the same control information which is used by the even-odd protocol.

Measurements were done on parallel applications. The overhead of process duplication using the even-odd protocol was compared with the overhead using a m2-protocol.

On the most examples the protocol can save a quarter of the protocol messages on average compared to a m2-Protocol. The reduction of the message number leads to a lower protcol overhead and also to a lower load of the communication network. Here we present the results using a direct solver for a system of linear equations as example. Figure 2 shows the protocol overhead in percent measured with different problem sizes. As algorithm the gauss-jordan elimination is used. The problem size appears as the number of unknown variables. This application uses the even-odd protocol less beneficial, because of some m-steps at the beginning. Therefore some m2-steps were forced by the application. As result we obtained a more beneficial selection of protocol steps for the following communications.

The combiantion with roll-forward recovery is based on the following ideas:

  • Backtracing to the duplex system which caused the fault.When a fault is detected, negative acknowlegdements are transfered to the sender. A retransmission of the recently sent messages per m-steps is activated and versions are generated. This is done via the entire path of a possible fault propagation.
  • Decision if faulty locally. With the retransmitted messages the decision is possible, if the fauly state was caused by the duplex partner of the same application process or it was the result of a fault propagation.
  • During the program is progressing in two isolated halfs, the fault is local identified by a retry phase starting from a recently taken checkpoint. When the fault free version is identified, every replica of the fault free half copy its state to its partner of the faulty half.
  • Validation of single replica phases. This may be done later in parallel to the progress of the application where every application process is a duplex system again.

The advantage to backward recovery is, that only the creation of additional checkpoints delays the application progress. This is necessary to copy states from a process to another process or to obtain the basis for a validation run. The combination of the even-odd protocol and roll-forward recovery is an optimistic fault tolerance scheme, because the overhead in the fault free case is low, and the delay in occurence of single faults lower compared to backward recovery. In less probable scenarios (multiple faults during a relatively short period), a rollback is necessary.

References

[LoFA 90] J. Long, W.K. Fuchs and J.A. Abraham,
A Forward Error Recovery Strategie using Checkpointing in Distributed Systems, In: International Conference on Parallel Processing (Proceedings), August,1990 .

[PrVa 94] D.K. Pradhan and N.H. Vaidya,
Roll-Forward Checkpointing Scheme: A Novel Fault-Tolerant Architecture, IEEE Transactions on Computers 43(10), October1994.


1. Author contact: Dresden University of Technology, Institute of Computer Systems, D-01062 Dresden, Germany
Phone: +49 351 4638554. Fax: +49 351 4638245. E-Mail: ps1@irz.inf.tu-dresden.de