Software-Implemented Fault Tolerance for Supercomputing in Space

John A. Rohr
Jet Propulsion Laboratory
Pasadena, California, USA
John.A.Rohr@Jpl.Nasa.Gov

The NASA Jet Propulsion Laboratory Remote Exploration and Experimentation (REE) Project is a large multi-year technology demonstration project which will develop low-power, scalable, fault-tolerant, high-performance computing for use in space and will demonstrate that significant onboard processing capability enables a new class of science missions. This will permit increased data collection rates, mitigate downlink limitations, and reduce ground station operations, enabling greater science return at lower cost.

The REE Project plans to attain these goals by adapting for routine use in outer space, earth-based supercomputing technology developed by industry, while dramatically reducing the mass, size, and power consumption of these systems. REE will adapt commercially-developed, ultra-low-power components for use in fault-tolerant architectures which can be scaled with available power to provide high performance and which can handle the high rate of radiation-induced transient errors that are expected in space. A testbed is currently under development which will be used to experiment with fault-tolerance implementations and demonstrate space science applications. After the experiments have been completed and the results analyzed, a flight prototype will be built.

A primary goal of the REE Project is to validate computing efficiencies on the order of hundreds of MIPS per watt in a multiprocessor architecture that can scale from 1 to 100 watts, depending on the specific application and mission requirements. At the low end of this spectrum, REE will develop spacecraft data systems, including mass storage, that are capable of operating on less than one watt of electrical power for extremely power-constrained missions. At the high end, REE plans to develop spacecraft data systems which will scale up to multiple processors to provide thousands of MIPS of performance for missions that are not severely power-constrained. These designs must be capable of reliable operation in space for 10 years or more using commercially-available components.

The REE Project intends to leverage commercial computing technology to the greatest extent possible to maximize performance and minimize power and cost while taking advantage of available software, support tools, and standards that are available. The use of components based on commercial designs creates a significant problem with transient errors (called single-event upsets or SEU's in space terminology) which can occur frequently in the space environment. Thus fault-tolerant designs will be required to provide satisfactory operation, and the fault-tolerance design approaches will face two fundamental constraints:

1. Since the REE Project has chosen to use components that are functionally identical to commercial components, no changes can be made to their designs. While this will enable use of all software developed for the commercially-available components, it also results in limiting the fault-tolerance mechanisms which can be incorporated to interface interconnection circuits surrounding the commercial components and software written in the environments provided.

2. The requirement for extremely low power precludes the use of massive redundancy techniques for all but the most critical functions and data.

Thus fault-tolerance mechanisms that are used within the constraints of the REE Project will be required to handle errors which result from single-event upsets and permanent failures of system components, and they must facilitate system recovery and resumption of normal operation after a fault occurs.

It is expected that the usual fault-tolerance features which are incorporated in commercial technology will be available. For example, the communications systems can include multiple paths, error-detecting/correcting codes, retries, automated routing, and other capabilities. Memories can utilize multiple copies, error-correcting codes, and address protection registers. Overall control can include timers, heartbeats, and redundancy. Finally, any fault-tolerance features which are available in the processors which are used may be utilized to further increase the fault tolerance of the entire system. Additional hardware features which can be incorporated into the systems will be limited to only interconnection circuits that support fault tolerance since the processors themselves cannot be altered.

Although architectural mechanisms can be expected to contribute to the fault tolerance of computer systems developed by the REE Project, the limited flexibility in hardware design and implementation will require that much of the fault-tolerance capability be provided by the use of software-implemented fault-tolerance techniques. Because of the desire to maximize the use of commercial software already developed for the processors used in REE computer systems and because of the complexity of modifying such software, many of the software fault-tolerance capabilities provided in computer systems developed by the REE Project will most likely be provided by middleware, which is a layer of software which is inserted between the operating system and the applications programs. Middleware will be used to provide checkpointing and restart, multiple levels of fault tolerance for different applications running concurrently, monitoring of transient faults, and other capabilities such as consensus determination of redundant processes and results.

The use of middleware for software fault tolerance will not provide all the software capabilities needed. Faults that occur during execution of the operating system will largely be beyond the reach of middleware. Thus fault-tolerance mechanisms will be needed in the operating system itself. However, this area may be the most difficult to handle, since no major work on operating systems is planned as a part of the project. Applicable independent work which is done in this area, however, will be incorporated into the REE project.

Much of the fault-tolerance capability for the REE Project is expected to be provided by the applications programs themselves. Often only the programmer can best specify optimal placement for the checkpoints and the algorithms for acceptance checks and calibrated computations that are needed for fault detection and recovery. Although automated insertion of such capabilities would be desirable, it is beyond the intended scope of the REE Project to develop such techniques. However, many existing software fault-tolerance mechanisms and techniques are available which can be used by the REE applications programs. An initial check which can be done by all programs is to perform range checks and reasonableness tests. Data which is clearly out of range will indicate the possible presence of errors. Also, programs and constant data can be checkpointed and periodically checked to ensure that they have not changed. Beyond these simple checks, the computational results of many algorithms can be checked by applying inverse algorithms or other data manipulations. For example, the inverse of a matrix can be multiplied the original matrix. If the result is the identity matrix, both the original calculations and the check can be presumed to be correct. Another example is the use of assertions which can be dynamically checked for validity within the program.

It is well-known that software-implemented error detection coupled with the constraint of not using massive redundancy due to severe power limitations will limit the overall fault-tolerance coverage of the REE system. Thus the challenge lies in providing satisfactory performance within these constraints.

Since the primary goal for use of the high performance REE system is science data processing rather than spacecraft control, the primary objective is high availability rather than guaranteeing that no errors will be made in computations. Thus the system will contain a small, heavily-protected hard core that can periodically reload, diagnose, and restart the system to handle those errors and faults that may have been missed by the software-implemented fault-tolerance mechanisms.

The development of software-implemented fault tolerance is an integral part of the REE Project that will involve the JPL REE team its contractors, the science application developers, and university researchers. Each has a part to contribute to ensuring the success of this important technology demonstration project.