Fast Abstracts Archives . .

FastAbstracts


WHAT IS a
FastAbstract

The History

Archives of
FastAbstracts

ISSRE 2003
ISSRE 2002
ISSRE 2001
ISSRE 2000
ISSRE 1999
ISSRE 1998
FTCS 1999
FTCS 1998



 

 

 

 

Fault-based DRB (Distributed Recovery Block) for Switching Systems

Byung Sun Lee1
Electronics and Telecommunications Research Institute
Taejon, Korea
E-mail: mailto:bslee@etri.re.kr
 
Yong Rae Kwon
Korea Advanced Institute of Science and Technology
Taejon, Korea
E-mail:mailto:yrkwon@salmosa.kaist.ac.kr

Abstract

The Distributed Recovery Block (DRB) is a well-known fault-tolerant technique for real-time systems, which treats hardware and software faults uniformly and provides a forward recovery scheme. We consider improving the DRB for practical use of large and complex real-time systems. We propose a new fault-tolerant technique, which is specially designed to be more useful for large real-time systems, especially switching systems. Our technique is based on the measurements of types and effects of faults with testing experience in software development of switching systems. Our approach is the Fault-based DRB (FDRB) with a self-checking and selective recovery mechanism added to it. We compared our approach with the ordinary DRB through simulations with injected faults.

1. Introduction

Design diversity is one of the software techniques based on the provision of design redundancy to provide tolerance to human mistakes made while devising software [Avi86]. The Distributed Recovery Block (DRB) was proposed to treat both hardware and software faults in a uniform manner [Kim88][Kim89]. Its relatively low run-time overhead makes it suitable for incorporation into real-time systems. It uses multiple distributed computing systems (DCSs), each executing one and only one recovery block. The smallest unit of software redundancy used is a version of the full application program running on a computing station.

However, there exist constraints for DRB to be practically used in large and complex systems. It does not consider the types and effects of faults and requires the same amount of recovery time regardless kinds of faults occurred in a block. Therefore, two alternatives can be considered for a large real-time system. One is to use a selective recovery mechanism according to fault type and effect in a computing station and the other is to make a hierarchical software structure using the DRB for a large system.

2. Fault Measurement

We selected a prototype telephone switching system to characterize the types and effects of faults found in distributed real-time systems. A typical switching system software is very large and complex, and includes various functions for telephony hardware resources. The switching system requires high reliability, especially availability for the grade of services. The measurements were made over two years when the switching system was undergoing an operational testing for field trials.

As summarized on Table 1, 55% of faults can be detected with a user defined self-checking program in the application software shown, and 30% of all faults were either recovered or ignored with user-defined programs before the acceptance test. Therefore, a full fault tolerance scheme was not necessary; some self-checking programs and selective recovery mechanisms are sufficient in this case.

Fault category

Detectable
Undetectable

Total

recoverable

negligible

unrecoverable

Ratio

23%

7%

25%

45%

100%

Table 1. The detectable faults by self-checking programs

3. Fault-based DRB (FDRB)

Due to a system failure occur, incorrect results, or no results at all, maybe delivered by the DRB. Possible causes of failure can be an undetected fault or rejection by the acceptance test of the results provided by all the alternates. We use a self-checking program and an exception handler for minimizing undetected faults. A distributed computing system is assumed to have the following characteristics:

  • A system consists of a set of computing stations (CSs) and each CS executes one and only one recovery block.
  • A CS cannot send any messages to other computing stations or other environments unless the execution results pass the acceptance test.
  • Only the primary node can send a message as an output to the successor CS.
  • A service function of the switching software is designed and executed by a set of CSs on a sequential basis.
  • There are no shared objects that can be commonly used by multiple CSs in the system.

Figure 1 illustrates the basic structure of a computing station. We have added a self-checking(SC) program and a recovery-handling(RH) program to the conventional construct which consists of a primary block, an alternate block, and an acceptance test(AT) program. The self-checking program is added to dynamically analyze the operational behavior of the block. It finds out faults left undetected by the DRB and extends the fault detection coverage with the application experiences. With the recovery-handling program, a programmer can assess the information on the faults to determine the best way to recover from the fault. It includes routines for recovery and exception handling. Anticipating certain types of faults in the operation of services, a programmer can prepare certain exception handling programs to reinitialize the faulty part of a block or to recover the fault.

Figure 1. The basic structure of FDRB

4. Evaluation of FDRB

To evaluate our approach, we selected a simulation program of PABX system. Using design diversity, we implemented our approach and make four versions of the target program using ITU-T CHILL [ITU97]. We divided the simulation program into four blocks,each of which runs on one of the computing stations. We selected four types of faults for experimental evaluation of the above algorithms and injected them into the second block of extended simulation programs one by one. The causes and effects of the selected faults were different as shown on Table 2.

No.

Types of faults

Categories of faults

Effects of faults

1

Pointer error

Incorrect computation

in a CS

2

Data inconsistency

Data fault

in a process

3

Illegal input signal

Unexpected situation

in a CS

4

Data out-of-range

Missing operation

in a process

Table 2. Four types of injected faults

The Table 3 depicts the execution time of each version. SCP (Self-Checking Programming) is easily implemented but has unrecoverable faults such as type 1, 3 and 4.

Fault types

Versions

Without

faults

With faults

Recoverability

Type 1

Type 2

Type 3

Type 4

Base Program

15.6

-

-

-

-

No

SCP

18.5

-

18.7

-

-

Partially Yes

DRB

17.4

35.2

34.5

34.2

*42.3

Yes

FDRB

21.3

24.3

22.1

22.9

24.6

Yes

Table 3. Execution time of four versions (Time: msec)

The FDRB needs 20% more execution time than the DRB when without faults and less recovery time than the DRB. The DRB version has various recovery-times depending on the types and effects of the faults. Expert programmers can much reduce the recovery time as they prepare the self-checking and recovery-handling programs. As a result of our experience, all faults that yield CS or system failures should be detected as early as possible. We find that some OS characteristics and support functions are key factors affecting the recovery time of any kind of fault. These functions are the precise handling of the timer, switching over two nodes, signal sending, context switching, process start/stop and exception handling. We have experience in compensating for the related OS functions and getting meaningful results with a well-optimized program structure.

5. Conclusions

The most effective method of achieving fault-tolerant software depends on the failure impact, the type of software, and the application. We know that various techniques for software fault-tolerance are needed for large switching systems, through fault measurements on a prototype system.

Therefore, we propose a new fault tolerant technique that is more useful for large real-time systems, especially switching systems. Fault-based DRB is a hybrid technique combining the DRB and the self-checking programming. As a result of experimental evaluation of the FDRB, we find that the average recovery time of the fault is reduced without sacrificing software reliability of a system and failure-free probability is higher than with the ordinary DRB. The self-checking programs and the recovery-handling routines in our approach seem to have some overhead, but they are relatively small and can be ignored as shown in our experimental evaluation.

References

[Avi86] A. Avizienis, and J. Kelly, "Dependable Computing: From concepts to Design diversity," Proc. IEEE, Vol.74, No.5, pp.629-638, May 1986
[Kim88] K. H. Kim, and J. C. Yoon, "Approaches to Implementation of a Repairable Distributed Recovery Block Scheme," Proceedings of FTCS-18, pp.50-55, Jun. 1988
[Kim89] K. H. Kim, and Howard O. Welch, "Distributed Execution of Recovery Blocks: An Approach for Uniform Treatment of Hardware and Software Faults in Real-Time Applications," IEEE Trans. on Computers, Vol.38, No.5, pp.626-636, May 1989
[ITU97] ITU-T, "CCITT High Level Language (CHILL)," ITU-T Recommendation Z.200, 1997


1. Author contact: Switching and Transmission Lab., ETRI, Kajong-Dong 161, Yusong-Gu, Taejon, 300-350, Korea, Phone: +82 42 860 6313, Fax: +82 42 860 5410, E-mail: mailto:bslee@etri.re.kr