Toward Middleware Fault Injection for Automotive Networks
Philip Koopman,
Eushiuan Tran, Geoff Hendrey
Electrical and Computer Engineering
Department &
Institute for Complex Engineered Systems
Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
Abstract
As embedded communication networks pervade widely fielded safety-critical
distributed systems, it is important to understand their robustness. Middleware
fault injection offers advantages in flexibility and cost over adding specialized
fault injecting network nodes. The research goal is accelerated testing
of an automated vehicle system with respect to transient communication network
faults.
Research Summary
Embedded communication networks are playing an increasingly important
role in safety-critical systems. Networking is being used to give greater
system design flexibility, improve diagnosability, and reduce wiring weight/size/cost.
As an example, prototype vehicles are using drive-by-wire capabilities,
in which critical functions are performed entirely by networked computers.
As this shift toward digital technology takes place, the importance of control
system dependability will increase dramatically.
While techniques for constructing dependable networks have been studied
for many years, large scale applications such as automobiles have somewhat
different design concerns than traditional aerospace and military uses.
For example, cost constraints limit the redundancy that can be installed,
and the large installed base makes it likely that even improbable failure
modes will be experienced somewhere within an operating fleet on a regular
basis. For example, an extremely improbable event in aviation
might be defined as occurring at a rate of 10-9 failures per
hour.[1] If one applies this to airlines tracked by
the U.S. government [2], this yields one failure per
73 years based on 13.7 million fleet operating hours in 1996. However, the
fleet of 200 million U.S. ground vehicles experiences about four orders
of magnitude more usage than airline equipment. Using U.S. government data
for vehicles [3], that same extremely improbable
aviation failure rate yields 82 failures per year (one every 4.5 days) based
on 2.469 trillion vehicle miles traveled if one assumes an estimated average
speed of 30 miles per hour. Vehicle failures caused by merely improbable
instead of extremely improbable failures would, of course, be
more numerous.
Because even extremely improbable (from a single unit point of view)
failures will occur as a matter of course within a large fleet, it is important
to understand the robustness of embedded control systems when experiencing
a large variety of anticipated and unanticipated failures. A specific area
of concern is in embedded control networks, which typically have a non-negligible
transmission error rate.
One would hope that typical product testing would find and correct system-level
problems caused by message transmission errors. The problem with relying
upon normal product testing, however, is that improbable events are unlikely
to be observed in any reasonable amount of time. Analysis and simulation
can be used to help predict worst-case behavior. But, these approaches are
difficult to apply on a system-level basis because of the expense of modeling
complete electromechanical systems.
Fault injection is an alternative to modeling, and is used to probe the
behavior of distributed systems. A typical approach is to add a commercially
available fault injection node to a network. Such an injection node has
the ability to selectively corrupt messages as they appear on the network.
This approach can provide the following capabilities:
- Inject globally detected errors (all receivers detect).
- Inject undetected data bit errors (gets past message CRC).
- Inject bursts of complete network failure.
- Saturate the network with irrelevant messages.
- Instrument message traffic on the network for capacity analysis.
However, there are limitations to such hardware fault injection. The
hardware fault injector can be expensive, which is an important consideration
in cost-sensitive, low-margin businesses such as the automotive industry
or university research. Additionally, hardware fault injectors are not able
to see inside the various nodes on the network, restricting
the types of errors they can inject and data they can collect.
More recently, work has been performed in software-implemented fault
injection (SWIFI) for communication networks. For example, Stott et al.
[4] modified high-speed LAN host interface board control
software to test system robustness in the face of injected faults. The automated
vehicle application that will serve as our experimental platform uses the
Controller Area Network (CAN), which is the de-facto standard automotive
control network. This brings with it the benefits of having a mature protocol
definition and implementation. However, it also makes it essentially impossible
to perform fault injection within the network controller, since the controller
is encapsulated within a fixed-design chip.
 |
| Figure 1. Middleware performs fault injection on network messages. |
A generalized SWIFI approach for network messages that avoids the need
to change network controller software is the use of middleware (Figure 1).
For example, Dawson et al. [5] use middleware
with a script-driven technique to test protocols. Our work also uses middleware
SWIFI, but for the purpose of testing application robustness, and with special
techniques required by the use of a low-speed control network. Middleware
fault injection has advantages beyond hardware fault injection, including
the ability to:
- Inject local reception errors (only some receivers discard a message).
Long networks or networks with locally extreme noise sources tend to experience
this problem.
- Delay message transmission to explore the effects of varying software
execution times.
- Delay control feedback and status responses.
- Instrument offered load and end-to-end latencies.
Because low-speed control networks can have sensitive timing constraints,
our SWIFI approach must find a way to coordinate system-wide fault injection
without interfering with system timing. Furthermore, the use of an off-the-shelf
protocol chip thwarts attempts to inject a detectably corrupted message.
These goals can be accomplished with the use of background messages and
a single-bit fault flag within message data fields.
System failure scenarios can be set up and initiated using low-priority
background messages from a laptop computer. Since there is some
slack bandwidth available in a typical system, and CAN supports global prioritization
of messages, this approach leaves the system essentially unperturbed. A
failure scenario is a list of actions initiated upon request from an experimenter,
and takes effect when an initiating background message has the opportunity
to be transmitted.
Within a failure scenario, delays are readily introduced by individual
transmitters. More problematic is the generation of message failures. The
CAN controller does not permit injecting a detectable data error, because
it computes the CRC field in hardware. However, there are three approaches
that can be used in various circumstances:
- Coordinate failures by sending scripts to all nodes in the network
(e.g., all steering nodes should assume the next 10 messages
of type 37 have detectable failures).
- Use a spare bit within each message format to flag a detectable data
error (there is usually at least one spare bit available). This permits
using very simple receiver middleware software in small nodes
that may only have a few hundred bytes of program memory available.
- Send dummy messages that are thrown away by all receiving
nodes in lieu of faulty messages. In this way network traffic is not varied,
but all the work is performed by the transmitting node. This requires no
middleware at small receiver nodes, but does require clever
allocation of message priorities to permit interleaving dummy message priorities
with real message priorities.
The current state of the work is that initial middleware software is
operating under the QNX real-time operating system with CAN hardware. Simulation
and analytic modeling will be used to determine appropriate failure scenarios
to inject so as to be representative of failure rates experienced in a large
deployed fleet. Ultimately, faults will be injected on an operating
automated vehicle to characterize system robustness and recommend changes
to control algorithms, if necessary, to achieve required system-level safety
goals.
This work is sponsored by ONR contract N00014-96-1-0202, and by USDOT
under Cooperative Agreement Number DTFH61-94-X-00001 as part of the National
Automated Highway System Consortium (NAHSC).
References
[1] Villemeur, A., Reliability,
Availability,
Maintainability and Safety Assessment, John Wiley & Sons, 1992
(pg. 533).
[2] U.S. Bureau of Transportation Statistics, Airline
Traffic Statistics spreadsheet, http://nasdac.faa.gov/bts/btsfrm41.xls
accessed April 10, 1998.
[3] U.S. Department of Transportation, NHTSA, Traffic
Safety Facts 1996, http://www.nhtsa.dot.gov/people/ncsa/Overvu96.pdf,
accessed April 10, 1998
[4] Stott, D., Ries, G., Hsueh, M. & Iyer, R.,
Dependability
analysis of a high-speed network using software-implemented fault injection
and simulated fault injection, IEEE Trans. Computers, 47
(1) 108-119, January 1998.
[5] Dawson, S., Jahanian, F., Mitton, T. & Tung,
T., Testing
of fault-tolerant and real-time distributed systems via protocol fault injection,
FTCS 96, pp. 404-414. |