Exploiting Control Theory for Proving the Stability of Distributed Algorithms

Oliver Theel 1 and Felix C. Gärtner 2
Darmstadt University of Technology
Department of Computer Science
Darmstadt, Germany

A particularly suitable design strategy for constructing a robust, fault-tolerant distributed application is to design it as a self-stabilizing algorithm [1,7]. Informally, an algorithm has the self-stabilizing property, if - starting from an illegal state - it is guaranteed to return to a specified set of legal states after a finite period of time. Additionally, the set of legal states must be closed under normal system execution, meaning that the algorithm does not voluntarily switch to any illegal state. The definition of legal and illegal states depends on the particular application. Generally, all legal states are specified (e.g., by a state predicate) and illegal states are defined to be those states which are not legal states. This statement seems to be trivial and unimportant but it embodies the entire potential of self-stabilizing algorithms: if the algorithm is guaranteed to reach a legal state from any state, then it is able to tolerate any kind of failure that perturbs the system state (i.e., any kind of transient failure). The resulting algorithms are capable of tolerating failure scenarios which were unforeseeable at design time. Unfortunately, proving the self-stabilization property of an algorithm is a complicated task [4]. This is particularly true for close-to-reality message passing models of distributed computing, like the so-called distributed execution model [2], where proofs even for quite simple self-stabilizing algorithms require enormous efforts. Consequently, the research community is highly engaged in finding more adequate proof techniques.

The self-stabilization property of a distributed algorithm as described above exhibits interesting analogies to stable control circuits used in various engineering domains, like electrical and mechanical engineering. Informally, a control circuit is stable, if after a certain period of time, the controlled system reaches and remains in a pre-defined state [3]. Contrary to the self-stabilization research domain, which is a rather new area of research in computer science, control theory in the engineering domain has a century-old background and offers a well-understood theoretical foundation with powerful criteria for reasoning about the stability of control circuits.

The aim of our research is to narrow the gap between self-stabilization and control theory by adopting criteria originally used for deciding on the stability of control circuits for proving the self-stabilization of distributed algorithms. We demonstrate our approach in three steps:

  • We present a specific sample application.
  • We model the application as a self-stabilizing algorithm and sketch its complicated proof.
  • We model the application as a control circuit and sketch an elegant and simple proof.

It will turn out that the adoption of control theory is a viable means for obtaining stability statements for self-stabilizing distributed algorithms, which can be modeled as linear, discrete control circuits, in a very simple manner.

Application Scenario.

We assume a distributed environment where client processes (also called customers invoke certain services. The invocation of a service is not free of charge: the customer must pay for it by first acquiring and subsequently presenting a capability to the service provider. Such a capability can be thought of as a one-time capability (like a pre-paid service ticket [6]) or can be used for invoking a service multiple times within a given, fixed time interval (like Kerberos session keys). After the time interval has passed, the capability becomes invalid and the customer must acquire a new one, if needed.

Capabilities (or tickets in the scope of the paper) are created by a trusted authority, located at some node of the network. We call this trusted authority ticket provider. The ticket provider distributes a certain amount of tickets to ticket dealers located at various nodes of the network. The ticket dealers, on their turn, finally sell tickets to customers. The idea behind storing tickets not only at the ticket provider but at the ticket dealers is (1) to balance the load created by selling tickets to customers throughout the network and (2) to avoid an availability bottleneck with respect to service tickets, in case the ticket provider is temporarily unavailable. To be more precise, the ticket provider tries to guarantee that at any given time a ticket dealer at node i stores exactly ti tickets. The number of tickets which are actually stored at dealer i is denoted as f(ci) with ci being some local data structure of the ticket dealer. Unfortunately, due to the limited life-time of tickets stored at the dealers and due to customers purchasing tickets, f(ci) <> ti holds at some points in time. The problem to be solved is to find a distributed algorithm that compensates this drift.

Self-Stabilizing Algorithm Approach.

Apart from constructing an algorithm from scratch, a more efficient way to design and verify a self-stabilizing algorithm for the distributed execution model is (1) to design and verify the algorithm in the weaker serial model [2] and (2) to transform this algorithm into an algorithm based on the distributed execution model [5]. The former model assumes communication to take place through communication registers shared by neighboring processes. The latter model uses message passing techniques.

In the serial model, the application scenario can be modeled as follows: every ticket dealer i holds its tickets in a local data structure ci and additionally has access to the target number of tickets ti stored at the ticket provider. Legal states are then defined by a predicate P which is equivalent to fact that f(ci) = ti for all i. If f(ci) <> ti, the dealer ``requests new tickets from the provider'' and subsequently updates its local store ci once the missing tickets have been received. At any time, i.e., if f(ci) <> ti holds or not, expired tickets must be deleted from ci. (The ticket request procedure, invalidation of expired tickets, and the enforcement of certain ticket production constraints are modeled by finite state automata at the ticket dealer.)

The ``faults'' which may occur in the system are all actions that change the value of f(ci) or ti. In particular, f(ci) and ti, respectively, are impacted by customers purchasing tickets, by ticket expiration due to age, or by a direct modification of ti through the ticket provider. A correctness proof must address two issues [7]: (1) P is closed, and (2) starting from any state, the system eventually reaches a state in P. While the former requirement is easy to show, the latter demands a complex argument involving a monotonically decreasing bounded function rendering the entire proof quite complicated even for this simple application.

Control Circuit Approach.

Figure 1 shows the modeling of the application scenario as a control circuit for the ticket provider and one ticket dealer i.

Control Circuit

Since the entire control circuit is implemented based on the components of a distributed computing environment, it is necessarily a discrete control circuit. This means that state changes can only occur and are only observed at discrete points kT0, k in N0, T0 > 0, in time. Consequently, ti(kT0) is the target number of tickets the ticket provider wants to be stored at the ticket dealer at time kT0. f(ci,kT0) denotes the number of still valid tickets stored at the ticket dealer at time kT0. At every discrete point in time kT0, the ticket provider sends a message containing the newly issued tickets to the ticket dealer. Such a message may be delayed due to a network latency d1T0, d1 in N0, i.e., a message sent at time k1T0 is received by the ticket dealer at time (k1+d1)T0. An analogous network latency d2T0, d2 in N0, may occur for messages sent from the ticket dealer to the ticket provider. In such a message sent at time k2T0, the ticket provider, after reception, is informed about the number of tickets available at the ticket dealer at time (k2-d2)T0. v is a correction term which compensates the amplification factor of the control circuit. It assures, that (if stable) ti = f(ci), as given by the application scenario, holds. ui(kT0) is the difference (1/v)ti(kT0) - f(ci,kT0). Gc, Gp, Gd1, and Gd2 specify the behavior of the ticket provider, the ticket dealer and the communication network, respectively. Their description is beyond the scope of this abstract.

The stability proof for the control circuit is obtained as follows. In a first step, the z-Transform of all terms introduced above is calculated. Secondly, using the z-Transforms, the output/input behavior of the control circuit f(ci,kT0)/((1/v) ti(kT0)) is derived, namely (as z-Transform)

Formula (1)

The third step concludes the proof by simply showing that all singularities of Formula (1) lie within a circle of radius one in the z-plane [3], which can be shown easily by adopting standard mathematical methods.

References

[1] E. W. Dijkstra. Self stabilizing systems in spite of distributed control. CACM, 17(11):643-644, 1974.

[2] S.-T. Huang, L.-C. Wuu, and M.-S. Tsai. Distributed execution model for self-stabilizing systems. In Proc. of the 14th Intern. Conf. on Distr. Comp. Systems, pages 432-439, 1994.

[3] R. Isermann. Digitale Regelsysteme, Band I (in German). Springer-Verlag, New York, 1988. ISBN 0-387-16596-7.

[4] J. L. W. Kessels. An exercise in proving self-stabilization with a variant function. Inform. Proc. Letters, 29:39-42, 1988.

[5] M. Mizuno and H. Kakugawa. A timestamp based transformation of self-stabilizing programs for distributed computing environments. In Proc. of the 10th Intern. Workshop on Distr. Algo., Springer LNCS:1151, pages 304-321, 1996.

[6] H.-H. Pagnia, C. Liebig, and O. Theel. Mutual Trust between Customers and Providers of Distributed Services. In Proc. of the Intern. Conf. on Networks, Orlando, FL, pages 87-90. 1996.

[7] M. Schneider. Self-stabilization. ACM Computing Surveys, 25(1):45-67, 1993.

 


1. Author contact: Oliver Theel, Darmstadt University of Technology, Department of Computer Science, D-64283 Darmstadt, Germany. Phone: [49] (6151) 16-5306. Fax: [49] (6151) 16-5410. E-Mail: theel@informatik.tu-darmstadt.de

2. Author contact: Felix C. Gärtner, Darmstadt University of Technology, Department of Computer Science, D-64283 Darmstadt, Germany. Phone: [49] (6151) 16-3908. Fax: [49] (6151) 16-5410. E-Mail: felix@informatik.tu-darmstadt.de