Fault-based DRB (Distributed Recovery Block) for Switching
Systems
-
Byung Sun Lee1
Electronics and Telecommunications Research Institute
Taejon, Korea
E-mail: mailto:bslee@etri.re.kr
-
-
Yong Rae Kwon
Korea Advanced Institute of Science and Technology
Taejon, Korea
E-mail:mailto:yrkwon@salmosa.kaist.ac.kr
Abstract
The Distributed Recovery Block
(DRB) is a well-known fault-tolerant technique for real-time systems,
which treats hardware and software faults uniformly and provides a forward
recovery scheme. We consider improving the DRB for practical use of large
and complex real-time systems. We propose a new fault-tolerant technique,
which is specially designed to be more useful for large real-time systems,
especially switching systems. Our technique is based on the measurements
of types and effects of faults with testing experience in software development
of switching systems. Our approach is the Fault-based DRB (FDRB) with
a self-checking and selective recovery mechanism added to it. We compared
our approach with the ordinary DRB through simulations with injected faults.
1. Introduction
Design diversity is one of the software techniques
based on the provision of design redundancy to provide tolerance to human
mistakes made while devising software [Avi86]. The
Distributed Recovery Block (DRB) was proposed to treat both hardware and
software faults in a uniform manner [Kim88][Kim89].
Its relatively low run-time overhead makes it suitable for incorporation
into real-time systems. It uses multiple distributed computing systems
(DCSs), each executing one and only one recovery block. The smallest unit
of software redundancy used is a version of the full application program
running on a computing station.
However, there exist constraints for DRB to
be practically used in large and complex systems. It does not consider
the types and effects of faults and requires the same amount of recovery
time regardless kinds of faults occurred in a block. Therefore, two alternatives
can be considered for a large real-time system. One is to use a selective
recovery mechanism according to fault type and effect in a computing station
and the other is to make a hierarchical software structure using the DRB
for a large system.
2. Fault Measurement
We selected a prototype telephone
switching system to characterize the types and effects of faults found
in distributed real-time systems. A typical switching system software
is very large and complex, and includes various functions for telephony
hardware resources. The switching system requires high reliability, especially
availability for the grade of services. The measurements were made over
two years when the switching system was undergoing an operational testing
for field trials.
As summarized on Table 1, 55% of faults can
be detected with a user defined self-checking program in the application
software shown, and 30% of all faults were either recovered or ignored
with user-defined programs before the acceptance test. Therefore, a full
fault tolerance scheme was not necessary; some self-checking programs
and selective recovery mechanisms are sufficient in this case.
|
Fault category
|
Detectable
|
Undetectable |
Total
|
|
recoverable
|
negligible
|
unrecoverable
|
|
Ratio
|
23%
|
7%
|
25%
|
45%
|
100%
|
Table 1. The detectable faults by self-checking programs
3. Fault-based DRB (FDRB)
Due to a system failure occur, incorrect results, or
no results at all, maybe delivered by the DRB. Possible causes of failure
can be an undetected fault or rejection by the acceptance test of the
results provided by all the alternates. We use a self-checking program
and an exception handler for minimizing undetected faults. A distributed
computing system is assumed to have the following characteristics:
- A system consists of a set of computing stations (CSs) and each CS
executes one and only one recovery block.
- A CS cannot send any messages to other computing stations or other
environments unless the execution results pass the acceptance test.
- Only the primary node can send a message as an output to the successor
CS.
- A service function of the switching software is designed and executed
by a set of CSs on a sequential basis.
- There are no shared objects that can be commonly used by multiple
CSs in the system.
Figure 1 illustrates the basic structure of
a computing station. We have added a self-checking(SC) program and a recovery-handling(RH)
program to the conventional construct which consists of a primary block,
an alternate block, and an acceptance test(AT) program. The self-checking
program is added to dynamically analyze the operational behavior of the
block. It finds out faults left undetected by the DRB and extends the
fault detection coverage with the application experiences. With the recovery-handling
program, a programmer can assess the information on the faults to determine
the best way to recover from the fault. It includes routines for recovery
and exception handling. Anticipating certain types of faults in the operation
of services, a programmer can prepare certain exception handling programs
to reinitialize the faulty part of a block or to recover the fault.
Figure 1. The basic structure of FDRB
4. Evaluation of FDRB
To evaluate our approach,
we selected a simulation program of PABX system. Using design diversity,
we implemented our approach and make four versions of the target program
using ITU-T CHILL [ITU97]. We divided the simulation
program into four blocks,each of which runs on one of the computing stations.
We selected four types of faults for experimental evaluation of the above
algorithms and injected them into the second block of extended simulation
programs one by one. The causes and effects of the selected faults were
different as shown on Table 2.
|
No.
|
Types of faults
|
Categories of faults
|
Effects of faults
|
|
1
|
Pointer error
|
Incorrect computation
|
in a CS
|
|
2
|
Data inconsistency
|
Data fault
|
in a process
|
|
3
|
Illegal input signal
|
Unexpected situation
|
in a CS
|
|
4
|
Data out-of-range
|
Missing operation
|
in a process
|
Table 2. Four types of injected faults
The Table 3 depicts the execution time of
each version. SCP (Self-Checking Programming) is easily implemented but
has unrecoverable faults such as type 1, 3 and 4.
| Fault
types
Versions |
Without
faults
|
With faults
|
Recoverability
|
|
Type 1
|
Type 2
|
Type 3
|
Type 4
|
|
Base Program
|
15.6
|
-
|
-
|
-
|
-
|
No
|
|
SCP
|
18.5
|
-
|
18.7
|
-
|
-
|
Partially Yes
|
|
DRB
|
17.4
|
35.2
|
34.5
|
34.2
|
*42.3
|
Yes
|
|
FDRB
|
21.3
|
24.3
|
22.1
|
22.9
|
24.6
|
Yes
|
Table 3. Execution time of four versions (Time: msec)
The FDRB needs 20% more execution time than
the DRB when without faults and less recovery time than the DRB. The DRB
version has various recovery-times depending on the types and effects
of the faults. Expert programmers can much reduce the recovery time as
they prepare the self-checking and recovery-handling programs. As a result
of our experience, all faults that yield CS or system failures should
be detected as early as possible. We find that some OS characteristics
and support functions are key factors affecting the recovery time of any
kind of fault. These functions are the precise handling of the timer,
switching over two nodes, signal sending, context switching, process start/stop
and exception handling. We have experience in compensating for the related
OS functions and getting meaningful results with a well-optimized program
structure.
5. Conclusions
The most effective method of achieving fault-tolerant software depends
on the failure impact, the type of software, and the application. We know
that various techniques for software fault-tolerance are needed for large
switching systems, through fault measurements on a prototype system.
Therefore, we propose a new fault tolerant technique that is more useful
for large real-time systems, especially switching systems. Fault-based
DRB is a hybrid technique combining the DRB and the self-checking programming.
As a result of experimental evaluation of the FDRB, we find that the average
recovery time of the fault is reduced without sacrificing software reliability
of a system and failure-free probability is higher than with the ordinary
DRB. The self-checking programs and the recovery-handling routines in
our approach seem to have some overhead, but they are relatively small
and can be ignored as shown in our experimental evaluation.
References
[Avi86]
A. Avizienis, and J. Kelly, "Dependable Computing: From concepts
to Design diversity," Proc. IEEE, Vol.74, No.5, pp.629-638, May
1986
[Kim88] K. H. Kim, and
J. C. Yoon, "Approaches to Implementation of a Repairable Distributed
Recovery Block Scheme," Proceedings of FTCS-18, pp.50-55, Jun.
1988
[Kim89] K. H. Kim, and
Howard O. Welch, "Distributed Execution of Recovery Blocks: An
Approach for Uniform Treatment of Hardware and Software Faults in Real-Time
Applications," IEEE Trans. on Computers, Vol.38, No.5, pp.626-636,
May 1989
[ITU97] ITU-T, "CCITT
High Level Language (CHILL)," ITU-T Recommendation Z.200,
1997
Author contact: Switching and Transmission Lab., ETRI, Kajong-Dong
161, Yusong-Gu, Taejon, 300-350, Korea, Phone: +82 42 860 6313, Fax: +82
42 860 5410, E-mail: mailto:bslee@etri.re.kr |