Fast Abstracts Archives . .

FastAbstracts


WHAT IS a
FastAbstract

The History

Archives of
FastAbstracts

ISSRE 2003
ISSRE 2002
ISSRE 2001
ISSRE 2000
ISSRE 1999
ISSRE 1998
FTCS 1999
FTCS 1998



 

 

 

 

Resynchronization Methods for Software Supervision: Experience

Rudolph E. Seviora
Bell Canada Software Reliability Laboratory
University of Waterloo, Waterloo, Ont Canada, N2L 3G1
seviora@swen.uwaterloo.ca, V: +1-519-885-1211 ext 2850, F: +1-519-746-3077
 

Progress of scientific and engineering disciplines depends critically on advances in measuring instrumentation. At present, software reliability engineering relies primarily on manually collected failure data. Recently, several research efforts have addressed automatic, specification-based detection of software failures. In one such effort, a separate unit called the supervisor observes the inputs and outputs of the target software program [1]. Internally, the supervisor interprets the specification of external behavior of the program. Discrepancies between the expected and the observed behavior are reported as failures.

Several issues arise in supervision. These include the availability of formal specification of external behavior of the program and their representation, observability of the target I/O, impact of specification nondeterminism, tradeoff between accuracy and computational cost, and continuation of detection after an occurrence of failure. The latter arises because the supervisor must tract the state of the target as it executes. When a failure occurs, the supervisor can no longer be sure about the target's post-failure. Yet, the supervision must continue.

There are two principal alternatives for continuation. In one, the supervisor continues with the state the target software system should have, and reports all deviations as failures. In the other, the supervisor adjusts its internal state to that of the target (resynchronizes with the target), and reports only further deviations from this base. The latter is often preferred for field operation and testing, since the failure logs are not cluttered with redundant information.

This paper considers resynchronization for a class of software systems commonly found in telecommunications - event-driven, with multiple service access points (SAPs) and loose direct and indirect coupling between SAPs. The specifications of external behavior of such systems are typically expressed in a formalism based on communicating finite state machines. The ITU-T Specification and Description Language (SDL) is an example of such formalism.

Resynchronization is a very difficult problem. Conceptually, the supervisor can determine the post-failure state of the target by initially assuming that it could be in any possible state. During post-failure supervision, the individual state assumptions whose behavior does not match the observed behavior would be terminated. This is obviously very expensive. The number of candidates could be reduced by considering only those SAPs that were active around the time of the failure. However, for practical systems, the computational cost is still prohibitive.

The paper outlines two lower-cost methods for resynchronization and summarizes experience obtained in their evaluation. The first method reduces the cost of resynchronization by being more selective when generating candidate post-failure states. The second initially considers only groups of states (candidate superstates), with a subsequent resolution to a single state as the target executes.

Generating Operator Method

This method employs a set of operators to generate candidate execution trajectories from a known pre-failure state and observed pre-failure behavior [2]. The candidate generation operators are selected to represent effects of commonly occurring faults in the target. The set of trajectories (post-failure states) so generated is further pruned using a metric based on the distance between the candidate's behavior and the actually observed one.

An experimental evaluation of this method was carried out. The target was the call processing control program of a small telephone exchange. The program was specified in SDL. The exchange hardware and its telephones were simulated and the control program executed on a high-performance workstation. Four generation operators were used - signal insertion, deletion, modification and redirection. Faults typically found in these kinds of programs were then seeded into the control program.

The results obtained are summarized in the table below. The table shows the number of spurious failure reports and the average number of candidate post failure states generated with one seeded fault. Seeded faults are categorized based on whether they resulted in corruption of interprocess communication (IPC), long duration timers (such as timing for the first digit), or variable values.

Class of Fault

Spurious Failures

Avg Candid States

A (IPC)

0

2.36

B (timer)

0-1

3.0

C (variable)

>2

2.71

State Hierarchy Method

The second method employs an abstract model of external behavior of the target, in addition to its specification model [3]. In the abstract model, states of specification processes are grouped into superstates which exhibit some external commonality (for example, the telephone call states could be grouped into call setup, conversation, call takedown and idle superstate). When a failure is detected at a SAP, the supervision of the SAP continues at the abstract level only. All or a subset of superstates become the candidate post-failure superstates. During post-failure supervision, superstates whose external behavior does not match the observed one are terminated. If an external output/input is observed at a SAP whose presence uniquely identifies the SAP state, the supervision of the SAP resumes with the full specification model. While supervising at the abstract level, the failure detection capability at the access point is reduced.

This method was experimentally evaluated on the testbed described above. The target was again the call processing control program of the small exchange. The abstract model contained only two superstates for processes that specify the external behavior at telephones, idle and active. The latter became the candidate post-failure superstate for a phones which experienced failure. To speed up evaluation, the effects of faults in the target were modeled by directly inserting failures into traces of external behavior collected on a well-debugged version of the target. The supervisor only processed the seeded traces.

The results obtained are summarized in the table below. The table shows the percentage of experimental runs in which the supervisor was unable to resynchronize with the target and the number of spurious failure reports generated after the seeded failure, for one and two seeded failures (SF), and several levels of randomly generated telephone traffic (cppph = calls per phone per hour).
 

Load [cppph]

Resync Failures [%]

Spurious Failure Reports

SF=1

SF=2

SF=1

SF=2

4

0 %

2 %

0.65

0.62

8

12 %

16 %

0.74

0.69

12

6 %

26%

0.71

1.0

Summary

Overall, initial evaluations suggest that the first method reported fewer spurious failures, generated more candidate post-failure states, and had higher computational requirements. The second method was simpler, needed less computational power, but generated somewhat more spurious failure reports.

References

  1. T.Savor and R.E.Seviora: Towards Hierarchical Software Supervision", IEEE Computer, Vol 31, No 8, pp. 68-74, August 1998.
  2. R.Kovacevic: "A Resynchronization Scheme for Belief-Based Real-Time Software Supervision", 156 pp., Bell Canada Software Reliability Laboratory, University of Waterloo, 1996.
  3. C.Moffett: "State Abstracted Resynchronization", 114 pp., Bell Canada Software Reliability Laboratory, University of Waterloo, Technical Report, 1998.