A Software-Implemented Fault Injector for Real-Time Systems
- João Carlos Cunha* and João Gabriel
Silva**
-
* Instituto Superior de Engenharia de Coimbra
** Departamento de Engenharia Informática da Universidade
de Coimbra
Coimbra, Portugal
jcunha@isec.pt, jgabriel@dei.uc.pt
A real-time system is, by definition, a system that must satisfy explicit
response-time constraints, or risk severe consequences, including failure
[Laplante 93]. Since most of the mission-critical
and human-critical systems operate in real-time, these systems also require
a high degree of dependability. It can then be sustained that two of the
most important characteristics of real-time systems are the capability to
fulfill strict timing requirements and to guard against imperfect execution
environments.
Fault-tolerance emerges naturally as a key feature in real-time systems,
and can be defined as the ability of a real-time system to deliver the expected
service, in a timely manner, even in the presence of faults [Jahanian
94]. This definition introduces time as a main point, unlike the definition
for non-real-time systems, where time is only a performance matter.
Among the numerous approaches proposed to evaluate system dependability,
software-implemented fault injection (SWIFI) has been widely used, since
it is more flexible than hardware-implemented fault injection. It consists
basically on the execution of software that corrupts the system at certain
points, either by including injection code in the normal execution path,
or by means of special exceptions or interrupts.
However the development of a fault injector for real-time is a big challenge,
since timing requirements raise a major problem: the injection of a fault,
besides corrupting the state of the targeted component, may also alter the
behavior of the real-time process, due to the induced delay (probe effect).
Furthermore, the temporal intrusion in one process may affect the whole
system.
This may not be a problem for most hardware fault injectors, since most
of them are independent of the targeted system execution (like heavy-ion
radiation or power supply disturbances), and thus don't introduce any time
redundancy. However, for software fault injection it is a problem,
since time interference is inevitable.
In a real-time system, more important than low response time or high
throughput is predictability, giving the system designer the way
to project it according to the intended deadlines. So we state that the
basic principle of a fault injector for real-time systems is that:
- Time interference due to fault injection must be predictable.
According to this principle we can outline two main requirements for
a software-implemented fault injector for real-time systems:
- Fault injection and data collection should be reduced to a minimum
and quantifiable time. It will then be possible to define a time limit
for the execution of the fault injection code, without carrying the processes
to miss their deadlines.
- Experiment generation and control, and result analysis should be done
off-line. These processes usually take a great amount of time, and may
be executed earlier, later or in a different processor than the one executing
the target application. The only steps in the experiment that must be executed
concurrently with the benchmark are the fault injection and data collection.
We are currently developing a software-implemented fault injector, targeting
applications developed for a commercial of-the-shelf (COTS) real-time operating
system, SMX, from Micro Digital, running on a standard PC with one Pentium
processor. The fault injection core is based on the Xception [Carreira
98], a SWIFI tool that uses the debugging and performance monitoring
features of the processor for triggering faults and collecting results.
Conforming to the mentioned requirements, we are designing our fault
injector following the principles that:
- The experiment generation and control, and the results storing and
analysis is done in a host computer, different from the target system.
The target will accommodate the benchmarks, the fault injector and data
collection modules.
- Communications between the host and the target are done only before
and after each experiment.
- The work performed by the fault injector module in the target system
is done as much as possible during the system initialization (like memory
allocation, variable initializations, etc.) and reduced to the minimum
during the application execution.
- The time spent by the injection code during benchmark execution is
previously bounded. If it actually spends more time than the specified,
the experiment is aborted.
- Data collected is temporarily stored in memory and sent to the host
only when the experiment has finished. Since the data in memory could be
corrupted, or the computer could hang or crash and thus make it impossible
to retrieve its contents, we can optionally use a card with stable memory.
Whoever, using this option, our system will no longer be standard.
- It is be possible to inject a fault at any level of the target application
or the operating system kernel.
- The fault model includes faults in memory, in processor functional
units and buses and also time delays. Injecting a delay is useful, for
example, to force a task to miss its deadline.
- The faults are triggered by time delay, instruction or data access
and by external events. This external trigger will let us synchronize a
fault injection with a controlled system.
The presented figure sketches the main modules of the fault injector.
The host system interacts with the user, and controls all the experiments,
while the target system executes the benchmarks and suffers the fault injections.
The host system comprises three main modules:
- Fault generator. This module acts in a first stage, asking the
user for the definition of fault parameters and automatically generating
and storing a set of faults on disk.
- Experiment control. One fault at a time is collected from the
disk and sent to the target system in order to inject it. After the experiment
has finished, the results are collected and stored on disk. The host and
target systems communicate over a TCP/IP network. If a stable memory exists,
the results are collected from the target only after resetting it, so even
if the system crashes, it is almost always possible to retrieve some data.
- Data analysis. This module will be useful after the whole experiment
has finished. It reads the results of the fault injections and analyses
them.
The target system runs the real-time application, and suffers fault injections.
Three modules are added to perform this task:
- Injection setup. This module receives one fault from the host
and does all the needed setups and initializations in order to prepare,
as much as possible, the actual fault injection. Only after it has finished,
the benchmark starts executing. When the experiment is finished, this module
also sends the collected results to the host.
- Fault injector and data collector. This double-function module
is invoked by an exception/interrupt in order to corrupt the state of the
system. While injecting the fault, this module also gets the system context
and sends it to the data collection module. This module runs at the kernel
level, so has full access to the whole system to corrupt it.
- Data collection. This module stores the machine context before
and after the fault injection and, when asked for it, verifies its integrity.
The storing location may be on RAM or on stable memory, if one exists.
References
[Carreira 98] J.Carreira, H.Madeira, J.Silva,
"Xception: A Technique for the Evaluation of Dependability in Modern
Computers", IEEE Transactions on Software Engineering, vol.24, no.2,
pp.125-136, February 1998
[Jahanian 94] F.Jahanian, "Fault-Tolerance
in Embedded Real-Time Systems", Lecture Notes in Computer Science 774,
Hardware and Software Architectures for Fault Tolerance, pp.237-249, Springer-Verlag,
1994
[Laplante 93] P.Laplante, "Real-Time Systems
Design and Analysis an Engineers Handbook", IEEE Computer
Society Press, 1993 |