FTSM: A Fault-Tolerant Spaceborne
Microcontroller
- D. W. Caldwell, D. A. Rennels
- Department of Computer Science, 4731 Boelter Hall
- University of California, Los Angeles, CA 90024
The search within the aerospace industry to dramatically reduce the cost
of high-performance spacecraft makes the use of non-hardened, commercial
microcontrollers extremely attractive. These very inexpensive devices are
highly integrated computer systems on a chip: a processor and various support
functions such as program memory, scratchpad RAM, discrete I/O, A/D converters,
serial ports, and counter/timers. Commercial microcontrollers have not been
widely used in space because of their low radiation tolerance -- resulting
in a very high rate of transient errors and a potential for circuit latchup
[1]. As it is not cost-effective to modify these devices, fault-tolerance
is the only option available. FTSM is an architecture, currently implemented
in breadboard form and being tested, that is aimed at applying fault-tolerance
to deal with these problems and develop systems suitable for use in space.
Microcontrollers have little or no built-in fault tolerance features
and, since they are very highly integrated, it becomes an interesting design
challenge to protect the features already there (e.g., serial buses, programmable
bi-directional I/O pins) and to make use of them in implementing fault-tolerance.
A key constraint is to minimize the amount of external support logic so
that the main advantage of microcontrollers', high functional density, can
be maintained.
The block diagram of a spacecraft subsystem (e.g., an IMU) containing
an FTSM is shown in Figure 1. Redundant microcontrollers operate in a voted
configuration, and they are time-triggered by a common real time interrupt.
At the top of the figure, the subsystem interface is shown as just a source
of power and communications. The I/O of multiple processors are combined
and protected by I/O isolation and connected with the sensors and actuators.
An External Conflict Resolver circuit may reset or power-cycle the individual
microcontrollers.
Figure 1: A Subsystem with an Embedded FTSM
During normal operation, one microcontroller is the Master of the system,
while the others provide redundant computation and voting opinions as Checkers.
The Master and Checkers are loosely synchronized and execute identical application
programs, periodically calling support functions which implement the fault-tolerance
features. If a microcontroller disagrees with its peers, it can be commanded
offline and brought back as a Checker if it can be successfully restarted.
Devices are not statically assigned so the operating mode of each device
is fluid [2].
Most of the I/O of the multiple processors is bussed together - corresponding pins
are connected to a common circuit node through isolation elements (resistors);
only the signals governing the external conflict resolution are unique to
each processor. This approach simplifies interconnection and makes the architecture
easily extensible. Some of the microcontroller's I/O pins are consumed implementing
the Check I/O necessary for fault-tolerance; the remainder are available
to the application as Normal I/O. The Check I/O pins provide three
functions supporting fault-tolerance: 1) a Master Channel (2-pin
serial bus) for data communications between processors, 2) an Operating
Mode Channel (6-pins) to allow each processor to broadcast to the others
that it is in one of three operating modes, and 3) four External Resolver
control signals (4-pins) from each micro-controller to request recovery
actions. All the check I/O signals use existing I/O functions on the microcontroller
and only require external resistive isolation circuits be added.
Each microcontroller, executing identical code, expects voting events
and their associated messages to appear on the Master Channel. If it detects
an error, it signals a special External Conflict Resolver requesting that
an action be applied to itself, its left neighbor, its right neighbor, or
all the modules. The action request is to either reset or power-cycle the
specified processor(s), and the External Conflict Resolver carries out an
action if it is requested by two or more microcontrollers. In order to reduce
its susceptibility to transient errors, it is made up of combinational logic.
Circuit isolation is essential in this design: (i) to prevent catastrophic
shorts, and (ii) to make it possible to remove power from a module for latchup
circumvention. Isolation is provided with resistors in series with the bussed
I/O pins of the microcontroller as shown in Figure 2. Pull-down resistors
are also employed so that the common output becomes the logical OR of the
individual signals.
Power must quickly be turned off when a latchup is detected, but external
logic signals can leak past the input protection diodes and still parasitically
power the chip enough to fail to clear the latchup condition. In this design
when power is removed, Vcc is shorted to ground, and the series resistors
divide logic voltages to an acceptably low level.
Figure 2: Circuit Isolation
In addition to the conventional ways of protecting outputs (voting outputs
at the actuators or strobing data into radiation-resistent latches in the
actuators immediately after a successful vote), a novel low-cost masking
approach has been developed to protect outputs against single-event upsets.
The isolation circuits between microcontrollers produce a logical OR of
their individual output signals. If any microcontroller produces a "one"
on a bussed output line, the result will be a logic "one". Due
to the bi-directional nature of microcontroller pins, two flip flops must
be set for an output pin to generate a one-- an output 3-state enable and
the data latch. Thus we always reset both flip flops for output signals
in each microcontroller when a zero is desired. It takes multiple bit-flips
in output data and control registers to generate an erroneous output.
An experimental testbed has been constructed based on the 8-bit Microchip
PIC 16C7x micro-controller. A small circuit card was constructed with triplicated
microcontrollers, a PLD-based external conflict resolver and the specially
isolated interconnections between processors. To simplify initial testing,
three in-circuit emulators are used in the circuit board, and the emulators
and I/O circuits are connected to a PC that controls and runs fault-insertion
experiments on the testbed. The microcontrollers can be interrupted and
controlled through their built-in asynchronous serial ports.
A transient fault (simulating a single-event upset) can be inserted in
any processor by flipping any bit in any memory location or register and
observing the response of the system to the inserted error. Initial test
results are very promising. Recovery has occurred in over 99.9% of tests.
In the majority of cases, one processor is voted off-line and computation
continues while it is restarted. In approximately 1/3 of the cases the Master
is taken off-line requiring a rollback/restart of the system.
We plan to interface the microcontrollers to an inertial measurement
unit consisting of three surplus gyros and accelerometers to provide a realistic
application environment in which the hardware and software of the FTSM resides.
A future step in this development will be to expose the microcontrollers
to a real radiation environment either on the ground or in space to see
how well the fault-tolerance techniques work in their planned environment.
** This work is sponsored at UCLA by the Office of
Naval Research under grant #N00014-96-1-0837
References
1. G. C. Messenger, M. S. Ash. "The Effects of Radiation on Electronic
Systems." Second Edition. Van Nostrand Reinhold, New York, 1992.
2. D. W. Caldwell, D. A. Rennels. "A Minimimalist Hardware Architecture
for Using Commercial Microcontrollers in Space." 16th Digital Avionics
Systems Conference, Irvine, CA. 28-30 Oct 1997.
_________________________
1. Author Contact 4731 Boelter Hall, UCLA Los, Angeles 90024, tel. 818
790 2195, FAX 310 825 2273, email rennels@cs.ucla.edu.
|