A Software-Implemented Fault Injector for Real-Time Systems

João Carlos Cunha* and João Gabriel Silva**
* Instituto Superior de Engenharia de Coimbra
** Departamento de Engenharia Informática da Universidade de Coimbra
Coimbra, Portugal
jcunha@isec.pt, jgabriel@dei.uc.pt

 

A real-time system is, by definition, a system that must satisfy explicit response-time constraints, or risk severe consequences, including failure [Laplante 93]. Since most of the mission-critical and human-critical systems operate in real-time, these systems also require a high degree of dependability. It can then be sustained that two of the most important characteristics of real-time systems are the capability to fulfill strict timing requirements and to guard against imperfect execution environments.

Fault-tolerance emerges naturally as a key feature in real-time systems, and can be defined as the ability of a real-time system to deliver the expected service, in a timely manner, even in the presence of faults [Jahanian 94]. This definition introduces time as a main point, unlike the definition for non-real-time systems, where time is only a performance matter.

Among the numerous approaches proposed to evaluate system dependability, software-implemented fault injection (SWIFI) has been widely used, since it is more flexible than hardware-implemented fault injection. It consists basically on the execution of software that corrupts the system at certain points, either by including injection code in the normal execution path, or by means of special exceptions or interrupts.

However the development of a fault injector for real-time is a big challenge, since timing requirements raise a major problem: the injection of a fault, besides corrupting the state of the targeted component, may also alter the behavior of the real-time process, due to the induced delay (probe effect). Furthermore, the temporal intrusion in one process may affect the whole system.

This may not be a problem for most hardware fault injectors, since most of them are independent of the targeted system execution (like heavy-ion radiation or power supply disturbances), and thus don't introduce any time redundancy. However, for software fault injection it is a problem, since time interference is inevitable.

In a real-time system, more important than low response time or high throughput is predictability, giving the system designer the way to project it according to the intended deadlines. So we state that the basic principle of a fault injector for real-time systems is that:

  • Time interference due to fault injection must be predictable.

According to this principle we can outline two main requirements for a software-implemented fault injector for real-time systems:

  1. Fault injection and data collection should be reduced to a minimum and quantifiable time. It will then be possible to define a time limit for the execution of the fault injection code, without carrying the processes to miss their deadlines.
  2. Experiment generation and control, and result analysis should be done off-line. These processes usually take a great amount of time, and may be executed earlier, later or in a different processor than the one executing the target application. The only steps in the experiment that must be executed concurrently with the benchmark are the fault injection and data collection.

We are currently developing a software-implemented fault injector, targeting applications developed for a commercial of-the-shelf (COTS) real-time operating system, SMX, from Micro Digital, running on a standard PC with one Pentium processor. The fault injection core is based on the Xception [Carreira 98], a SWIFI tool that uses the debugging and performance monitoring features of the processor for triggering faults and collecting results.

Conforming to the mentioned requirements, we are designing our fault injector following the principles that:

  • The experiment generation and control, and the results storing and analysis is done in a host computer, different from the target system. The target will accommodate the benchmarks, the fault injector and data collection modules.
  • Communications between the host and the target are done only before and after each experiment.
  • The work performed by the fault injector module in the target system is done as much as possible during the system initialization (like memory allocation, variable initializations, etc.) and reduced to the minimum during the application execution.
  • The time spent by the injection code during benchmark execution is previously bounded. If it actually spends more time than the specified, the experiment is aborted.
  • Data collected is temporarily stored in memory and sent to the host only when the experiment has finished. Since the data in memory could be corrupted, or the computer could hang or crash and thus make it impossible to retrieve its contents, we can optionally use a card with stable memory. Whoever, using this option, our system will no longer be standard.
  • It is be possible to inject a fault at any level of the target application or the operating system kernel.
  • The fault model includes faults in memory, in processor functional units and buses and also time delays. Injecting a delay is useful, for example, to force a task to miss its deadline.
  • The faults are triggered by time delay, instruction or data access and by external events. This external trigger will let us synchronize a fault injection with a controlled system.

The presented figure sketches the main modules of the fault injector. The host system interacts with the user, and controls all the experiments, while the target system executes the benchmarks and suffers the fault injections.

The host system comprises three main modules:

  • Fault generator. This module acts in a first stage, asking the user for the definition of fault parameters and automatically generating and storing a set of faults on disk.
  • Experiment control. One fault at a time is collected from the disk and sent to the target system in order to inject it. After the experiment has finished, the results are collected and stored on disk. The host and target systems communicate over a TCP/IP network. If a stable memory exists, the results are collected from the target only after resetting it, so even if the system crashes, it is almost always possible to retrieve some data.
  • Data analysis. This module will be useful after the whole experiment has finished. It reads the results of the fault injections and analyses them.

The target system runs the real-time application, and suffers fault injections. Three modules are added to perform this task:

  • Injection setup. This module receives one fault from the host and does all the needed setups and initializations in order to prepare, as much as possible, the actual fault injection. Only after it has finished, the benchmark starts executing. When the experiment is finished, this module also sends the collected results to the host.
  • Fault injector and data collector. This double-function module is invoked by an exception/interrupt in order to corrupt the state of the system. While injecting the fault, this module also gets the system context and sends it to the data collection module. This module runs at the kernel level, so has full access to the whole system to corrupt it.
  • Data collection. This module stores the machine context before and after the fault injection and, when asked for it, verifies its integrity. The storing location may be on RAM or on stable memory, if one exists.

 


References

[Carreira 98] J.Carreira, H.Madeira, J.Silva, "Xception: A Technique for the Evaluation of Dependability in Modern Computers", IEEE Transactions on Software Engineering, vol.24, no.2, pp.125-136, February 1998

[Jahanian 94] F.Jahanian, "Fault-Tolerance in Embedded Real-Time Systems", Lecture Notes in Computer Science 774, Hardware and Software Architectures for Fault Tolerance, pp.237-249, Springer-Verlag, 1994

[Laplante 93] P.Laplante, "Real-Time Systems Design and Analysis – an Engineer’s Handbook", IEEE Computer Society Press, 1993