Resynchronization Methods for Software
Supervision: Experience
- Rudolph E. Seviora
- Bell Canada Software Reliability Laboratory
- University of Waterloo, Waterloo, Ont Canada, N2L
3G1
- seviora@swen.uwaterloo.ca, V: +1-519-885-1211 ext
2850, F: +1-519-746-3077
-
Progress of scientific and engineering disciplines depends
critically on advances in measuring instrumentation. At present,
software reliability engineering relies primarily on manually
collected failure data. Recently, several research efforts have
addressed automatic, specification-based detection of software
failures. In one such effort, a separate unit called the supervisor
observes the inputs and outputs of the target software program
[1]. Internally, the supervisor interprets the specification
of external behavior of the program. Discrepancies between the
expected and the observed behavior are reported as failures.
Several issues arise in supervision. These include the availability
of formal specification of external behavior of the program and
their representation, observability of the target I/O, impact
of specification nondeterminism, tradeoff between accuracy and
computational cost, and continuation of detection after an occurrence
of failure. The latter arises because the supervisor must tract
the state of the target as it executes. When a failure occurs,
the supervisor can no longer be sure about the target's post-failure.
Yet, the supervision must continue.
There are two principal alternatives for continuation. In
one, the supervisor continues with the state the target software
system should have, and reports all deviations as failures. In
the other, the supervisor adjusts its internal state to that
of the target (resynchronizes with the target), and reports only
further deviations from this base. The latter is often preferred
for field operation and testing, since the failure logs are not
cluttered with redundant information.
This paper considers resynchronization for a class of software
systems commonly found in telecommunications - event-driven,
with multiple service access points (SAPs) and loose direct and
indirect coupling between SAPs. The specifications of external
behavior of such systems are typically expressed in a formalism
based on communicating finite state machines. The ITU-T Specification
and Description Language (SDL) is an example of such formalism.
Resynchronization is a very difficult problem. Conceptually,
the supervisor can determine the post-failure state of the target
by initially assuming that it could be in any possible state.
During post-failure supervision, the individual state assumptions
whose behavior does not match the observed behavior would be
terminated. This is obviously very expensive. The number of candidates
could be reduced by considering only those SAPs that were active
around the time of the failure. However, for practical systems,
the computational cost is still prohibitive.
The paper outlines two lower-cost methods for resynchronization
and summarizes experience obtained in their evaluation. The first
method reduces the cost of resynchronization by being more selective
when generating candidate post-failure states. The second initially
considers only groups of states (candidate superstates), with
a subsequent resolution to a single state as the target executes.
Generating Operator Method
This method employs a set of operators to generate candidate
execution trajectories from a known pre-failure state and observed
pre-failure behavior [2]. The candidate generation operators
are selected to represent effects of commonly occurring faults
in the target. The set of trajectories (post-failure states)
so generated is further pruned using a metric based on the distance
between the candidate's behavior and the actually observed one.
An experimental evaluation of this method was carried out.
The target was the call processing control program of a small
telephone exchange. The program was specified in SDL. The exchange
hardware and its telephones were simulated and the control program
executed on a high-performance workstation. Four generation operators
were used - signal insertion, deletion, modification and redirection.
Faults typically found in these kinds of programs were then seeded
into the control program.
The results obtained are summarized in the table below. The
table shows the number of spurious failure reports and the average
number of candidate post failure states generated with one seeded
fault. Seeded faults are categorized based on whether they resulted
in corruption of interprocess communication (IPC), long duration
timers (such as timing for the first digit), or variable values.
|
Class of Fault |
Spurious Failures |
Avg Candid States |
|
A (IPC) |
0 |
2.36 |
|
B (timer) |
0-1 |
3.0 |
|
C (variable) |
>2 |
2.71 |
State Hierarchy Method
The second method employs an abstract model of external behavior
of the target, in addition to its specification model [3]. In
the abstract model, states of specification processes are grouped
into superstates which exhibit some external commonality (for
example, the telephone call states could be grouped into call
setup, conversation, call takedown and idle superstate). When
a failure is detected at a SAP, the supervision of the SAP continues
at the abstract level only. All or a subset of superstates become
the candidate post-failure superstates. During post-failure supervision,
superstates whose external behavior does not match the observed
one are terminated. If an external output/input is observed at
a SAP whose presence uniquely identifies the SAP state, the supervision
of the SAP resumes with the full specification model. While supervising
at the abstract level, the failure detection capability at the
access point is reduced.
This method was experimentally evaluated on the testbed described
above. The target was again the call processing control program
of the small exchange. The abstract model contained only two
superstates for processes that specify the external behavior
at telephones, idle and active. The latter became the candidate
post-failure superstate for a phones which experienced failure.
To speed up evaluation, the effects of faults in the target were
modeled by directly inserting failures into traces of external
behavior collected on a well-debugged version of the target.
The supervisor only processed the seeded traces.
The results obtained are summarized in the table below. The
table shows the percentage of experimental runs in which the
supervisor was unable to resynchronize with the target and the
number of spurious failure reports generated after the seeded
failure, for one and two seeded failures (SF), and several levels
of randomly generated telephone traffic (cppph = calls per phone
per hour).
|
Load [cppph] |
Resync Failures [%] |
Spurious Failure Reports |
|
SF=1 |
SF=2 |
SF=1 |
SF=2 |
|
4 |
0 % |
2 % |
0.65 |
0.62 |
|
8 |
12 % |
16 % |
0.74 |
0.69 |
|
12 |
6 % |
26% |
0.71 |
1.0 |
Summary
Overall, initial evaluations suggest that the first method
reported fewer spurious failures, generated more candidate post-failure
states, and had higher computational requirements. The second
method was simpler, needed less computational power, but generated
somewhat more spurious failure reports.
References
- T.Savor and R.E.Seviora:
Towards Hierarchical Software Supervision", IEEE Computer,
Vol 31, No 8, pp. 68-74, August 1998.
- R.Kovacevic: "A Resynchronization
Scheme for Belief-Based Real-Time Software Supervision",
156 pp., Bell Canada Software Reliability Laboratory, University
of Waterloo, 1996.
- C.Moffett:
"State Abstracted Resynchronization", 114 pp., Bell
Canada Software Reliability Laboratory, University of Waterloo,
Technical Report, 1998.
|