Fault-Tolerance via Test and Reconfiguration
- Shawn Blanton*, Herman Schmit* and Seth Goldstein
- Electrical and Computer Engineering
- Computer Science Department
- Carnegie Mellon University, Pittsburgh, PA
-
Reconfigurable hardware will
dramatically reduce the cost of achieving fault tolerance, thereby enabling
a whole host of new reliable, low-cost applications. The reconfigurable
nature of the hardware compounds this benefit by enabling the implementation
of a wide variety of applications within a single architecture. Power, speed,
cost, and now fault-tolerance can be optimized in a whole new way. In addition,
these attributes can be modulated dynamically during application execution
to ensure the highest level of performance in an operational environment
that may be ever-changing.
Reconfigurable hardware enables
a multidimensional design space for creating fault-tolerant applications.
Unlike traditional fault-tolerant systems, applications implemented in a
reconfigurable hardware fabric will be designed using traditional flows.
No special restrictions or enhancements will be needed to achieve reliability.
Instead the reconfigurable controller and compiler will seamlessly impose
the circuit attributes required for gaining the desired level of fault tolerance.
Moreover, the level of fault tolerance does not have to remain static but
can be dynamically altered to adapt to environmental and/or user changes.
We envision the scenario illustrated
in Fig. 1
. Applications that require high performance and/or low power and no fault
tolerance are represented by the single module of Fig. 1
a. Here, the original design is configured by the fabric controller and
compiler for optimal performance. Applications that require some mid-range
of fault tolerance can be implemented as shown in Fig. 1
b. This mode of operation (termed error-detect mode) requires that the original
module be duplicated (maybe virtually) and operated in parallel with the
original. A comparison module is also added to detect any discrepancies
between the two module outputs. A failure in error-detect mode requires
that the application halt and retry (after a potential reconfiguration)
resulting in a graceful degradation of operational performance only. Finally,
Fig. 1
c illustrates the highest mode of fault-tolerant operation. N-modular redundancy
can be employed for applications that require extremely high levels of reliability.
This configuration, at the cost of slower operating frequency and/or higher
power consumption, ensures that application operation can continue uninterrupted.
Note all three operating modes require no alteration in the original design,
that is, all configurations are handled automatically by the fabric controller
and compiler. 
The operating modes of Fig. 1 exhibit
a high degree of flexibility. For example, consider the error-detect mode
of operation of Fig. 1
b. Unlike traditional TMR, more of the hardware fabric is now free to execute
operations that are desirable but not critical to mission operation. If
the error-detect mode discovers a failure, the hardware can then be quickly
reconfigured into a TMR system until the failure can be characterized and
isolated. In TMR mode, more resources are utilized to ensure reliable mission
operation within a faulty fabric at the expense of not executing other less-critical
operations. This solution ensures that maximum resources are first dedicated
to the optimal performance of the application (speed, power, etc.) and not
towards protection against failures that yet do not exist.
We also envision new and highly
cost-effective methods for achieving fault tolerance through frequently
executed built-in self-test (FEBIST). Reconfigurable hardware is inherently
regular in nature. Both the logic (LUTs, CLBs, etc.) and the interconnect
are highly uniform. This regular structure combined with the reconfigurability
aspects of the fabric make the built-in self test and diagnosis of the fabric
both quick and efficient. Some applications may have sufficient time slack
that allows the fabric's test aspects to be exploited to achieve fault tolerance.
In this scheme, alternating portions of the fabric are frequently reconfigured
for a BIST session which: (1) Ensures the tested fabric is free of hard
failures. (2) In the presence of failures, identifies the smallest diagnosable
piece of the faulty fabric (CLB or interconnect segment). The diagnostic
results of the BIST session can then be used by the reconfiguration controller
to "configure around" the faulty fabric. The outcome of the proposed
FEBIST is also illustrated in Fig. 1
. The grey shaded portions of the various blocks represent portions of the
faulty fabric that have been switched out of the block and replaced by fault-free
fabric. Note that both application modules and circuitry dedicated to fault
tolerance (voter, comparison, etc.) can have faulty fabric switched out.
This switching out of faulty fabric restores the reliability of the application
once a failure has been discovered, hence increasing the overall reliability.
The frequency of the FEBIST
determines the level of fault tolerance achieved. If FEBIST sessions are
executed often, than the likelihood of a hard failure going undiscovered
and causing a system error is less probable. Notice that transient and intermittent
failures can still be dealt with in FEBIST using traditional data encoding
techniques like m-out-of-n codes. We therefore have a new dimension for
tuning and achieving various levels of fault tolerance in applications that
have enough slack to execute FEBIST. Less frequent FEBIST establishes lower
levels of fault tolerance but increases performance. Thus, the use of FEBIST
illustrates a new unexplored dimension for implementing fault-tolerant systems.
The envisioned scenario above
is only possible if reconfiguration can be done quickly and seamlessly.
Commercial FPGAs currently do not have this capability. The cached virtual
hardware (CVH) architecture developed here at CMU allows single-cycle configuration
for unit-size portions of the hardware fabric. This aspect of CVH along
with its virtualization of the applications makes the considered scenarios
possible.
Finally, it should be pointed
out that we view the three modes of operation identified in Fig. 1
as only possible implementation points along a performance continuum that
allows speed, fault-tolerance, and power to be traded off. Our vision is
to make such trade-offs along the whole continuum dynamically through the
CVH architecture and compiler in order to make applications that are truly
optimal given their current environmental conditions.
Advantages of Reconfigurable Fault Tolerance
Traditional fault-tolerant
systems incur a huge cost due to the extra resources required for achieving
reliable design. Less obvious are the costs associated with the additional
time and expertise necessary to design and validate the system's fault-tolerant
properties. Significant speed penalties can also result from fault tolerance.
In both hardware-based (such as NMR) and code-based (such as parity encoding)
techniques, input/output data must pass through checking circuitry to ensure
correctness. Finally, the increased power consumption can be a major drawback
in systems requiring high levels of reliability. For example, in TMR (NMR
with N=3), the amount of power required is more than tripled due to the
two additional units and voting logic required.
Reconfigurable hardware either
alleviates or altogether eliminates the traditional costs associated with
achieving fault tolerance.
There are many advantages
of using reconfigurable hardware to achieve fault tolerance compare to the
traditional techniques beyond what is described above.
- Probably the most significant advantage
of using reconfigurable hardware to achieve fault tolerance is the ability
to fine-tune the level of reliability desired. Because the hardware is
truly reconfigurable, the degree of fault tolerance can be chosen to meet
some predefined specification or altered dynamically to meet changing environmental
conditions. For situations, when low or no fault tolerance is desired,
the hardware can be configured to provide the best performance in terms
of speed and/or power.
- The extra cost of designing a fault-tolerant
system is now less. Because reconfigurability allows the amount of fault
tolerance to be tuned, a single design enables a wider variety of applications.
This implies that the extra development cost can be amortized over a larger
customer base.
- The use of reconfigurable hardware makes
for a truly active technique for achieving fault tolerance. The discovery,
isolation, and recovery of hard faults with reconfigurable hardware can
now be done at a level which maximizes the fault-free resources utilized
in the reconfigured system. In traditional fault-tolerant applications,
the granularity of fault isolation must be decided a priori. With reconfigurable
hardware, the granularity can be at the interconnect and sub-CLB level
and can be customized to the diagnostic methodology employed.
|