Fault-Tolerance via Test and Reconfiguration

Shawn Blanton*, Herman Schmit* and Seth Goldstein
Electrical and Computer Engineering
Computer Science Department
Carnegie Mellon University, Pittsburgh, PA
 

Reconfigurable hardware will dramatically reduce the cost of achieving fault tolerance, thereby enabling a whole host of new reliable, low-cost applications. The reconfigurable nature of the hardware compounds this benefit by enabling the implementation of a wide variety of applications within a single architecture. Power, speed, cost, and now fault-tolerance can be optimized in a whole new way. In addition, these attributes can be modulated dynamically during application execution to ensure the highest level of performance in an operational environment that may be ever-changing.

Reconfigurable hardware enables a multidimensional design space for creating fault-tolerant applications. Unlike traditional fault-tolerant systems, applications implemented in a reconfigurable hardware fabric will be designed using traditional flows. No special restrictions or enhancements will be needed to achieve reliability. Instead the reconfigurable controller and compiler will seamlessly impose the circuit attributes required for gaining the desired level of fault tolerance. Moreover, the level of fault tolerance does not have to remain static but can be dynamically altered to adapt to environmental and/or user changes.

We envision the scenario illustrated in Fig. 1 . Applications that require high performance and/or low power and no fault tolerance are represented by the single module of Fig. 1 a. Here, the original design is configured by the fabric controller and compiler for optimal performance. Applications that require some mid-range of fault tolerance can be implemented as shown in Fig. 1 b. This mode of operation (termed error-detect mode) requires that the original module be duplicated (maybe virtually) and operated in parallel with the original. A comparison module is also added to detect any discrepancies between the two module outputs. A failure in error-detect mode requires that the application halt and retry (after a potential reconfiguration) resulting in a graceful degradation of operational performance only. Finally, Fig. 1 c illustrates the highest mode of fault-tolerant operation. N-modular redundancy can be employed for applications that require extremely high levels of reliability. This configuration, at the cost of slower operating frequency and/or higher power consumption, ensures that application operation can continue uninterrupted. Note all three operating modes require no alteration in the original design, that is, all configurations are handled automatically by the fabric controller and compiler.

The operating modes of Fig. 1 exhibit a high degree of flexibility. For example, consider the error-detect mode of operation of Fig. 1 b. Unlike traditional TMR, more of the hardware fabric is now free to execute operations that are desirable but not critical to mission operation. If the error-detect mode discovers a failure, the hardware can then be quickly reconfigured into a TMR system until the failure can be characterized and isolated. In TMR mode, more resources are utilized to ensure reliable mission operation within a faulty fabric at the expense of not executing other less-critical operations. This solution ensures that maximum resources are first dedicated to the optimal performance of the application (speed, power, etc.) and not towards protection against failures that yet do not exist.

We also envision new and highly cost-effective methods for achieving fault tolerance through frequently executed built-in self-test (FEBIST). Reconfigurable hardware is inherently regular in nature. Both the logic (LUTs, CLBs, etc.) and the interconnect are highly uniform. This regular structure combined with the reconfigurability aspects of the fabric make the built-in self test and diagnosis of the fabric both quick and efficient. Some applications may have sufficient time slack that allows the fabric's test aspects to be exploited to achieve fault tolerance. In this scheme, alternating portions of the fabric are frequently reconfigured for a BIST session which: (1) Ensures the tested fabric is free of hard failures. (2) In the presence of failures, identifies the smallest diagnosable piece of the faulty fabric (CLB or interconnect segment). The diagnostic results of the BIST session can then be used by the reconfiguration controller to "configure around" the faulty fabric. The outcome of the proposed FEBIST is also illustrated in Fig. 1 . The grey shaded portions of the various blocks represent portions of the faulty fabric that have been switched out of the block and replaced by fault-free fabric. Note that both application modules and circuitry dedicated to fault tolerance (voter, comparison, etc.) can have faulty fabric switched out. This switching out of faulty fabric restores the reliability of the application once a failure has been discovered, hence increasing the overall reliability.

The frequency of the FEBIST determines the level of fault tolerance achieved. If FEBIST sessions are executed often, than the likelihood of a hard failure going undiscovered and causing a system error is less probable. Notice that transient and intermittent failures can still be dealt with in FEBIST using traditional data encoding techniques like m-out-of-n codes. We therefore have a new dimension for tuning and achieving various levels of fault tolerance in applications that have enough slack to execute FEBIST. Less frequent FEBIST establishes lower levels of fault tolerance but increases performance. Thus, the use of FEBIST illustrates a new unexplored dimension for implementing fault-tolerant systems.

The envisioned scenario above is only possible if reconfiguration can be done quickly and seamlessly. Commercial FPGAs currently do not have this capability. The cached virtual hardware (CVH) architecture developed here at CMU allows single-cycle configuration for unit-size portions of the hardware fabric. This aspect of CVH along with its virtualization of the applications makes the considered scenarios possible.

Finally, it should be pointed out that we view the three modes of operation identified in Fig. 1 as only possible implementation points along a performance continuum that allows speed, fault-tolerance, and power to be traded off. Our vision is to make such trade-offs along the whole continuum dynamically through the CVH architecture and compiler in order to make applications that are truly optimal given their current environmental conditions.

Advantages of Reconfigurable Fault Tolerance

Traditional fault-tolerant systems incur a huge cost due to the extra resources required for achieving reliable design. Less obvious are the costs associated with the additional time and expertise necessary to design and validate the system's fault-tolerant properties. Significant speed penalties can also result from fault tolerance. In both hardware-based (such as NMR) and code-based (such as parity encoding) techniques, input/output data must pass through checking circuitry to ensure correctness. Finally, the increased power consumption can be a major drawback in systems requiring high levels of reliability. For example, in TMR (NMR with N=3), the amount of power required is more than tripled due to the two additional units and voting logic required.

Reconfigurable hardware either alleviates or altogether eliminates the traditional costs associated with achieving fault tolerance.

There are many advantages of using reconfigurable hardware to achieve fault tolerance compare to the traditional techniques beyond what is described above.

  • Probably the most significant advantage of using reconfigurable hardware to achieve fault tolerance is the ability to fine-tune the level of reliability desired. Because the hardware is truly reconfigurable, the degree of fault tolerance can be chosen to meet some predefined specification or altered dynamically to meet changing environmental conditions. For situations, when low or no fault tolerance is desired, the hardware can be configured to provide the best performance in terms of speed and/or power.
  • The extra cost of designing a fault-tolerant system is now less. Because reconfigurability allows the amount of fault tolerance to be tuned, a single design enables a wider variety of applications. This implies that the extra development cost can be amortized over a larger customer base.
  • The use of reconfigurable hardware makes for a truly active technique for achieving fault tolerance. The discovery, isolation, and recovery of hard faults with reconfigurable hardware can now be done at a level which maximizes the fault-free resources utilized in the reconfigured system. In traditional fault-tolerant applications, the granularity of fault isolation must be decided a priori. With reconfigurable hardware, the granularity can be at the interconnect and sub-CLB level and can be customized to the diagnostic methodology employed.