Exploiting Control Theory for Proving the Stability of Distributed
Algorithms
Oliver Theel 1 and Felix C.
Gärtner 2
Darmstadt University of Technology
Department of Computer Science
Darmstadt, Germany
A particularly suitable design strategy for constructing
a robust, fault-tolerant distributed application is to design it as a self-stabilizing
algorithm [1,7].
Informally, an algorithm has the self-stabilizing property, if - starting
from an illegal state - it is guaranteed to return to a specified set of
legal states after a finite period of time. Additionally, the set of legal
states must be closed under normal system execution, meaning that the algorithm
does not voluntarily switch to any illegal state. The definition of legal
and illegal states depends on the particular application. Generally, all
legal states are specified (e.g., by a state predicate) and illegal states
are defined to be those states which are not legal states. This statement
seems to be trivial and unimportant but it embodies the entire potential
of self-stabilizing algorithms: if the algorithm is guaranteed to reach
a legal state from any state, then it is able to tolerate any
kind of failure that perturbs the system state (i.e., any kind of transient
failure). The resulting algorithms are capable of tolerating failure scenarios
which were unforeseeable at design time. Unfortunately, proving the self-stabilization
property of an algorithm is a complicated task [4].
This is particularly true for close-to-reality message passing models of
distributed computing, like the so-called distributed execution model
[2], where proofs even for quite simple self-stabilizing
algorithms require enormous efforts. Consequently, the research community
is highly engaged in finding more adequate proof techniques.
The self-stabilization property of a distributed algorithm
as described above exhibits interesting analogies to stable control circuits
used in various engineering domains, like electrical and mechanical engineering.
Informally, a control circuit is stable, if after a certain period of time,
the controlled system reaches and remains in a pre-defined state [3]. Contrary to the self-stabilization research domain,
which is a rather new area of research in computer science, control theory
in the engineering domain has a century-old background and offers a well-understood
theoretical foundation with powerful criteria for reasoning about the stability
of control circuits.
The aim of our research is to narrow the gap between
self-stabilization and control theory by adopting criteria originally used
for deciding on the stability of control circuits for proving the self-stabilization
of distributed algorithms. We demonstrate our approach in three steps:
- We present a specific sample application.
- We model the application as a self-stabilizing algorithm and sketch
its complicated proof.
- We model the application as a control circuit and sketch an elegant
and simple proof.
It will turn out that the adoption of control theory
is a viable means for obtaining stability statements for self-stabilizing
distributed algorithms, which can be modeled as linear, discrete control
circuits, in a very simple manner.
Application Scenario.
We assume a distributed environment where client processes
(also called customers invoke certain services. The invocation
of a service is not free of charge: the customer must pay for it by first
acquiring and subsequently presenting a capability to the service provider.
Such a capability can be thought of as a one-time capability (like a pre-paid
service ticket [6]) or can be used for invoking
a service multiple times within a given, fixed time interval (like Kerberos
session keys). After the time interval has passed, the capability becomes
invalid and the customer must acquire a new one, if needed.
Capabilities (or tickets in the scope of the
paper) are created by a trusted authority, located at some node of the network.
We call this trusted authority ticket provider. The ticket provider
distributes a certain amount of tickets to ticket dealers located
at various nodes of the network. The ticket dealers, on their turn, finally
sell tickets to customers. The idea behind storing tickets not only at the
ticket provider but at the ticket dealers is (1) to balance the load created
by selling tickets to customers throughout the network and (2) to avoid
an availability bottleneck with respect to service tickets, in case the
ticket provider is temporarily unavailable. To be more precise, the ticket
provider tries to guarantee that at any given time a ticket dealer at node
i stores exactly ti tickets. The number of
tickets which are actually stored at dealer i is denoted as f(ci)
with ci being some local data structure of the ticket
dealer. Unfortunately, due to the limited life-time of tickets stored at
the dealers and due to customers purchasing tickets, f(ci)
<> ti holds at some points in time. The problem to
be solved is to find a distributed algorithm that compensates this drift.
Self-Stabilizing Algorithm Approach.
Apart from constructing an algorithm from scratch, a
more efficient way to design and verify a self-stabilizing algorithm for
the distributed execution model is (1) to design and verify the algorithm
in the weaker serial model [2] and
(2) to transform this algorithm into an algorithm based on the distributed
execution model [5]. The former model assumes
communication to take place through communication registers shared by neighboring
processes. The latter model uses message passing techniques.
In the serial model, the application scenario can be
modeled as follows: every ticket dealer i holds its tickets in
a local data structure ci and additionally has access
to the target number of tickets ti stored at the ticket
provider. Legal states are then defined by a predicate P which
is equivalent to fact that f(ci) = ti
for all i. If f(ci) <> ti,
the dealer ``requests new tickets from the provider'' and subsequently updates
its local store ci once the missing tickets have been
received. At any time, i.e., if f(ci) <> ti
holds or not, expired tickets must be deleted from ci.
(The ticket request procedure, invalidation of expired tickets, and the
enforcement of certain ticket production constraints are modeled by finite
state automata at the ticket dealer.)
The ``faults'' which may occur in the system are all
actions that change the value of f(ci) or ti.
In particular, f(ci) and ti, respectively,
are impacted by customers purchasing tickets, by ticket expiration due to
age, or by a direct modification of ti through the ticket
provider. A correctness proof must address two issues [7]:
(1) P is closed, and (2) starting from any state, the system eventually
reaches a state in P. While the former requirement is easy to show,
the latter demands a complex argument involving a monotonically decreasing
bounded function rendering the entire proof quite complicated even for this
simple application.
Control Circuit Approach.
Figure 1 shows the modeling of the application scenario
as a control circuit for the ticket provider and one ticket dealer i.
Since the entire control circuit is implemented based
on the components of a distributed computing environment, it is necessarily
a discrete control circuit. This means that state changes can only
occur and are only observed at discrete points kT0,
k in N0, T0 > 0, in
time. Consequently, ti(kT0) is the target
number of tickets the ticket provider wants to be stored at the ticket dealer
at time kT0. f(ci,kT0)
denotes the number of still valid tickets stored at the ticket dealer at
time kT0. At every discrete point in time kT0,
the ticket provider sends a message containing the newly issued tickets
to the ticket dealer. Such a message may be delayed due to a network latency
d1T0, d1 in N0,
i.e., a message sent at time k1T0 is received
by the ticket dealer at time (k1+d1)T0.
An analogous network latency d2T0, d2
in N0, may occur for messages sent from the ticket dealer
to the ticket provider. In such a message sent at time k2T0,
the ticket provider, after reception, is informed about the number of tickets
available at the ticket dealer at time (k2-d2)T0.
v is a correction term which compensates the amplification factor
of the control circuit. It assures, that (if stable) ti
= f(ci), as given by the application scenario, holds.
ui(kT0) is the difference (1/v)ti(kT0)
- f(ci,kT0). Gc, Gp,
Gd1, and Gd2 specify the behavior of
the ticket provider, the ticket dealer and the communication network, respectively.
Their description is beyond the scope of this abstract.
The stability proof for the control circuit is obtained
as follows. In a first step, the z-Transform of all terms introduced above
is calculated. Secondly, using the z-Transforms, the output/input behavior
of the control circuit f(ci,kT0)/((1/v) ti(kT0))
is derived, namely (as z-Transform)
The third step concludes the proof by simply showing
that all singularities of Formula (1) lie within a circle of radius one
in the z-plane [3], which can be shown easily by
adopting standard mathematical methods.
References
[1] E. W. Dijkstra. Self
stabilizing systems in spite of distributed control. CACM, 17(11):643-644,
1974.
[2] S.-T. Huang, L.-C. Wuu,
and M.-S. Tsai. Distributed execution model for self-stabilizing systems.
In Proc. of the 14th Intern. Conf. on Distr. Comp. Systems, pages
432-439, 1994.
[3] R. Isermann. Digitale
Regelsysteme, Band I (in German). Springer-Verlag, New York, 1988.
ISBN 0-387-16596-7.
[4] J. L. W. Kessels. An
exercise in proving self-stabilization with a variant function. Inform.
Proc. Letters, 29:39-42, 1988.
[5] M. Mizuno and H. Kakugawa.
A timestamp based transformation of self-stabilizing programs for distributed
computing environments. In Proc. of the 10th Intern. Workshop on Distr.
Algo., Springer LNCS:1151, pages 304-321, 1996.
[6] H.-H. Pagnia, C. Liebig, and
O. Theel. Mutual Trust between Customers and Providers of Distributed Services.
In Proc. of the Intern. Conf. on Networks, Orlando, FL, pages 87-90.
1996.
[7] M. Schneider. Self-stabilization.
ACM Computing Surveys, 25(1):45-67, 1993.
|