Complex Failure Models for
Dependability Assessment

Péter Gáspár +
Computer and Automation Research Inst.
Hungarian Academy of Sciences

Géza Szabó *
Dep.of Control and Transport Automation
Technical University of Budapest

 Fault tolerant computer architectures play an increasingly important role in industrial control systems because of their fast response time, large computing capability and their fault detection ability [1]. Fault detection is important in such systems because the faulty component sometimes cannot be repaired immediately, e.g. sensors in a process field, and the function of the system should be re-configured according to the detected failures. The safety related systems are often a subject to a validation process and they have to reach a predefined availability level set by authority regulations. Such regulations often formulate the requirement as probabilities of malfunction, consequently, probabilistic safety assessment techniques should be applied to prove the unavailability level.

When using detection mechanisms in a re-configurable system, traditional assessment tools cannot be applied in the analysis, e.g. in the fault tree analysis, because in the system an event describing a fault has not only two possible states (fault is occurred or not) but three ones, namely fault has occurred and been detected, fault has occurred and not been detected, fault has not occurred. These three states should be distinguished at each level of the system and can be handled by multi-state analysis tools or by Markov chains [2].

In the traditional probabilistic safety analysis techniques, which do not use Markov chains for the analysis, certain models are used to describe the time dependent behavior of failures. It is important to use these models because without them the unavailability of the system will be overestimated and the predefined limit may be exceeded or it will seem necessary to use needless redundancy. The models for traditional methods exist and it can handled by the tools.

For the analysis of systems modeled as three-state systems, at least two models have to be used for each primary failure event (in some analyses it is called basic event), one for the probability of a detected failure versus time and the other for the probability of an undetected failure versus time. Traditional models can only be used if one of the two required models is equal to zero because the detected and undetected failures are not independent. The development of some complex models describing the detected and undetected probability failure is required for the three-state systems. These complex models are based on traditional ones.

In this context traditional models mean three type of widely used model. The first model is the computer coverage model, which can also be used for peripheral components with immediately failure detection and repair. The second model is the periodically-tested and repairable component model, which is important especially in industry. In this one there are several components, which cannot be tested efficiently by the system itself because an additional test signal has to be applied (automatic test is not allowed for safety reasons, etc.). The third model is the most simple non-repairable component model, which is used for mission critical applications.

Complex failure models

As in the following paragraphs the probabilities of detected and undetected failures will be distinguished, a conditional probability k should be defined. The value k is a probability, which means that a fault is detectable by tests if it has occurred. Of course this probability depends on the efficiency of the detection methods applied and can be increased by the implementation of more complex test cases. In the following definitions, failures are assumed to be exponentially distributed.

  • Continuously monitored, non-repairable components. This model describes a component which cannot be repaired during the observed time interval, e.g. during one fuel cycle of a plant, but some of the failures can be detected immediately after the occurrence. 

   Pdet= k (1-e-lt)  (1)

   Pundet= (1-k) (1-e-lt)  (2)

    where l is the failure rate.

  • Continuously monitored, repairable components. This model can be deviated from the previous one by repairing the component promptly after the detection of failure.
  • Periodically-tested, non-repairable components. This model may be useful for peripheries of computer-based systems e.g. data acquiring subsystems, that cannot be repaired because of the method of installation, etc., and the required tests take too long to be executed.

   Pdet= k (1-e-l(m-1)T)  (3)
  Pundet= 1-e-lt- k (1-e-l(m-1)T  (4)

where (m-1)T < t <=mT, T is the testing period. These components are illustrated by an academic example in Figure 1.

Figure 1: Probability of the failure of a periodically-tested,
non-repairable component

Periodically-tested, repairable components. These components are tested periodically, but the repair can be performed in a predefined time independently from the test interval. The repair time is assumed as zero. The equations for detected and for undetected parts are the same as (3) and (4), where t is replaced by t0 in the following way:

   t0 = t - nTr  (5)

where 0<= t0 < Tr , Tr is the repair interval, and n is an integer. The undetected part is described by the following equation:

   Pundet= 1-e-lt0- k (1-e-l(m-1)T)  (6)

These components are illustrated in Figure 2.

Figure 2: Probability of the failure of a periodically-tested,
repairable component

Summary

In this abstract complex time-dependent models have been introduced in order to determine the dependability values of the safety systems more precisely. The constructed models have been applied to the analysis of a computer-based protection system of power plant from the point of view of criteria defined by the authority. Since the research project has not been finished yet, further complex models have to be defined, taking other application areas into account.

References

[1] Steininger, A., C. Scherrer, "On Finding an Optimal Combination of Error Detection Mechanism Based on Results of Fault Injection Experiments", Proc. FTCS-27 , Seattle, Washington, USA , 1997, pp. 238-247.

[2] Kai, Y. "Multistate fault-tree analysis", Reliability Engineering and System Safety, Elsevier Science Publishers Ltd, Vol. 28, 1990, pp. 1-7.


Author contact:
+: Computer and Automation Research Institute, Hungarian Academy of Sciences; Kende u. 13-17, Budapest, Hungary, H-1111.
Phone: (+36-1) 166-7483, Fax:(+36-1) 166-7503 , E-mail: gaspar@sztaki.hu

*: Department of Transport Automation, Technical University of Budapest; Bertalan u. 2, Budapest, Hungary, H-1111.
Phone: (+36-1) 463-1979, Fax:(+36-1) 463-3087 , E-mail: szabo-g@kaut.kka.bme.hu