The Byzantine Generals Problem: Identifying the Traitors 

B. Ayeb and A. Farhat 
University of Sherbrooke
DMI-- RiFa Lab

There is a variety of algorithms proposed in the literature addressing the Byzantine problem. Their common objective is to mask the effect of  faulty units and reach an agreement among fault free units. This ongoing work aims  at identifying faulty units while the process of agreement is running. Beyond theoretical results, such  a identification of faulty units  contributes to: 

  • accelerate the agreement process itself. 
  • reduce drastically the number of messages. 

  • The overhead for identifying faulty units is rather not expensive, it is in O(n^3), where n denotes the number of units. 
  • We assume that units communicate with oral messages within a synchronous and reliable communication system. Finally,  a faulty unit could behave arbitrarily (i.e., the Byzantine fault model), whereas t denotes the number of faulty units and does not exceed n/3. The proposed mechanism for the identification of faulty units uses the following prosecution principle depicted on Figure 1. 

       

    unit q -------value v-------> unit p
    \ /
    \ /
    \ value v'
    \ /
    \ /
    \-----unit r------/

                          Figure 1 

    Suppose that  unit p receives directly a value, say v, from unit q and receives a second value, say  v', via unit r. If v and v' differ then unit p suspects the couple of units q and r. 

    Using the collected  couples, unit p could identify the faulty units in the following way. Whenever, a unit, say q, occurs more than t times then unit p declares q as faulty . In case where unit q occurs k times then a simple procedure  in O(n^3) ensures to unit p if it could declare unit q as faulty.  Finally, whenever unit p declares that unit q is faulty, it stops to communicate with q. It has been shown that this does not affect the process of agreement. 

    Now suppose that unit p owns its diagnosis (i.e., a subset of faulty units) and another unit, say r, has  also its own diagnosis. We propose a safe protocol for diagnoses exchange, where units p and r could make profit each from other. Note that the protocol takes into account the fact that  a faulty unit tries to send a non correct diagnosis. 

    As one could expect, without additional constraints it is impossible to ensure that all faulty units are identified. That why we state the necessary and sufficient conditions for the identification of all and only faulty units. Succinctly, the conditions specifies that all faulty units should prosecuted approximately from t+1 to  n^2 times. Considering the fact that the number of exchanged messages is in O(n^t), the requirement is rather reasonable. 

    By designing a grouped  architecture of units, we also succeed in designing an efficient algorithm for the identification of all faulty units without additional constraints. In fact, the identification of faulty units  makes  profit from a well suited broadcast primitive as well as the distributed architecture of units into a collection of communicating groups. 

    On going work includes theoretical analysis,  analytical comparison with related work [Sh 87, Co 93, Ch 96], and experimentation. Current work aims at measuring  the actual profits, namely in terms of messages exchange, and compares the running time of agreement with (rep. without) faults identification. First experimentation's shows an appreciable gain but further experimentation should shed more lights.   

    References

    [Sh 87] K. G. Shin & P. Ranmathan "Diagnosis of Processors with Byzantine Faults in a Distributed Computing System," 55-60, Proceedings of the FTCS 17, IEEE Computer Society Press, 1987. 
    [Co 93] B. A. Coan "Efficient Agreement Using Fault Diagnosis,"  Distributed Computing, (7): 87-98, Springler Verlag, 1993 
    [Ch 96] T. D. Chandra & S. Toueg  "Unreliable Failure Detectors for Reliable Distributed Systems,"  JACM, 43(2): 225-267, ACM Press, 1996. 
    [Ay 98] B. Ayeb & A. Farhat  "The Byzantine Problem: Masking and Demasking Faults ,"  Under Press, 1998. 


     Authors contact: University of Sherbrooke, Faculty of Sciences / DMI., RiFa Lab, Sherbrooke, PQ, Canada J1K 2R1. Phone: 819-821-8000, x: 2855. Fax: 819-821-8200. E-Mail: ayeb@dmi.usherb.ca.