Dependable Circuits Based on the XMR Architecture
Jose Miguel Vieira Santos 1, Jose Manuel Martins
Ferreira 2
- FEUP/ISEP- R.S. Tomé, 4200 Porto, Portugal; jmvs@dee.isep.ipp.pt
(contact)
- FEUP- R. Bragas, 4050 Porto, Portugal; jmf@fe.up.pt
Abstract
The new XMR architecture is a TMR-like design aiming single IC integration
and easy of design. An enhanced BST infrastructure replaces one replica
and the Voter, allowing error confinement and a timely recovery.
Keywords: Fault-Tolerance, F-T Architectures, BST, Concurrent
Test, Fault-Location.
Introduction
Many Fault-Tolerant (F-T) systems in automotive, railway, industrial
control and even aerospace, are real-time designs requiring error
confinement but, usually, may allow a small latency interval
to resume operation, if the outputs are set to a known-safe
state meanwhile [1]. This opens a wide field
of application to dependable Integrated Circuits (IC) providing these
features.
Most safety-critical hardware designs are self-checking (S-C)
or replication based architectures [2,3], but the integration of these architectures in VLSI
ICs (ASICs or PLDs) has advantages and drawbacks: the many gates available
are not accompanied by pin number impacting accessibility, and redundancy
is desirable to a certain level only [4]. Replication, being interesting in a VLSI (a TMR
for example), is however hampered by common mode faults (cmf);
S-C designs facing permanent faults must be duplex [5]. Duplication of the mission circuit is
a good compromise in VLSI ICs but has not enough information to resume a
correct operation and may only provide a fail-stop solution [6]. When a timely recovery
is acceptable, the latency delay, of many milliseconds or even seconds
[7], allows another
solution to be envisaged, with the help of the Boundary Scan (BS)
Test (BST) 1149.1 infrastructure [8].
The new architecture presented, XMR, may be seen as a sub-group
of N-Modular Redundancy (NMR) designs: a duplication design
in which an enhanced BST infrastructure (fully 1149.1 compliant) provides
decisions when the replicas disagree. Working alone, a XMR IC may only
confine errors, but a timely recovery is available when supported by a BST-controller,
a system-level BST interface needed in many designs [9].
XMR is presented for combinatory designs first; the extension to sequential
logic is considered in the end.
Background and Objectives
To start with we must clarify some terms here:
- CUT: the circuit under test, or mission circuit.
- Module: a 2-CUT IC with BST, expected to behave as a
Fault Containment Region [1].
- Fault model: single-faults only considered at first. Replication
based designs are vulnerable to cmf, but a XMR IC is able to detect
most permanent cmf.
Error confinement means an immediate detection avoiding
the error to spread out the module. To resume a correct operation
means to feed the outputs with the right CUT (always running) as soon as
the fault is located. A latency interval shorter than the systems
time granularity allows to correct the error, otherwise we talk
about correct recovery only.
Our objective is to design dependable ICs easily, with low overhead and
improved reliability. According to the theory of information applied to
digital testing [10],
the probability of detecting faults is directly related to the
quantity of information; then, the fault location interval is
inversely dependent on this quantity, related here to the redundancy mainly.
Since a 2-CUT module may recover when the fault is located, the XMR base
idea is to accept a delayed fault location in change with hardware overhead.
Considering features as VLSI IC limitations, fault model and latency,
concurrent test and design diversity, the objectives for a XMR IC where
defined:
- error confinement: provides a known output,
- fault location: allows to resume operation,
- duplication based IC with no coding circuitry,
- reuse of BST to compare outputs and vote,
- partial Design Diversity to detect cmf.
The XMR architecture relies on POST [ 11], a self-synchronized BST infrastructure enhanced
to reduce on-line scan requirements. The BST-controller (BSm
C, 1-4KB ROM, 8 mandatory pins) provides additional information on-line,
scanning the input and output test patterns (VTi+VTo)
previously defined for functional verification and stored in the
internal ROM; when an input match occurs the output may perform several
functions. POST was now optimized to verify a 2-CUT architecture concurrently,
enhanced to provide decisions, and adds a partial design diversity
to face cmf arising from replication inside a single IC.
The XMR Architecture
A XMR module has 2 CUTs, a set of input BS cells to detect input matches
(D), the set of enhanced output BS cells to provide decisions,
and the 1149.1 std TAP controller. CUT comparison confines
the error in the out cells, driving the output (Z) to a known state.
Independently, the BSm C scans the input and
output test vectors (VTi+VTo) continuously [11]; upon occurrence of an input match, the output cells
have information to vote and disable the faulty CUT, if any.
Let Q and QM be the quantity of information
in each CUT and in the module respectively, and rÎ[0,1[ the fraction of information accessible through
BST (the patterns stored in the BSm C); a XMR
module deals with QM=Q*(1+1+r),
allowing a TMR-like behavior with discrete voting and 4 modes
of operation:
- stand-alone (r =0): error
confinement, known output.
- normal (QM=Q*(1+1+r
)): error confinement, known output, recovery.
- single-CUT (QM=Q*(1+r
)): supervised in POST mode to provide a timely error detection.
- survival (QM=Q*r ):
VTo is injected if Vi=VTi [ 11].
To speed-up error treatment decisions must occur at BS cells level, requiring
to redesign one multiplexer in the output cells, for a double function:
- comparison of replicas (A, B).
- voting among replicas (A, B) and the content of a cell latch
(C), upon occurrence of an input match (D).
A disagreement activates the error signal (E) to force the cell
outputs (Zi) into one of 4 states:
- 0, 1, high-impedance: are design dependent to provide a known
output, and allow any output combination.
- to freeze the current output: cells need an extra latch.
The following input match disables the faulty CUT, and this interval
may be estimated as function of the inputs (n). Assuming a
new input every T interval and equi-probable distribution, the average delay
is t=(2n-1)*T, shown in the table for a 10Mhz operation
frequency. Partition may allow any number of inputs.
Extension to sequential designs
A XMR module with 2 sequential CUT always confines the error. Fault location
requires cells mirroring the memory elements, but the higher number of states
enlarges the detection intervals. Partition is a possible solution to get
sub-CUTs with no more than 18-20 inputs plus memory elements.
Wrong states must not return to valid states or a faulty CUT may
disable the right CUT.
Conclusions
The truth table to redesign the MUX, coherent with a TMR, increases the
cell hardware near 50%, with a signal path delay of 3 mandatory gates (2
in the std cell). The cell is tested in POST mode by reading the true
output on-line. Assuming that the BST infrastructure overheads near 5% [ 12], a XMR module may
have X<2.1 (IC hardware only).
Functional simulation test patterns are reused, enhancing the design
diversity provided through BST, and allow to detect permanent cmf
covered by their (single-) fault coverage [ 11].
The XMR-BST satisfies the objectives defined, is 1149.1 compliant
and may run in association with other F-T solutions.
Sequential circuits, reliable BS cell design and reliability features
of the BST chain are currently being addressed.
REFERENCES
1 - J.H Lala, R.E Harper, "Architectural Principles for Safety-Critical
Real-Time Applications", Proc. IEEE, V82 n1, Jan 1994, pp25-40.
2 - M. Nicolaidis, S. Noraz, B. Courtois, "A Generalized Theory
of Fail-Safe Systems", FTCS-19 Digest of Papers, IEEE Comp.
Society Press, 1989, pp398-406.
3 - P.K. Lala, Fault Tolerant, Fault Testable HW Design, Prentice/Hall
International, 1985.
4 - I. Koren, A.D. Singh, "Fault Tolerance in VLSI Circuits",
IEEE Computer, pp73-82,1990.
5 -M Lubaszewski, B Courtois,"On the Design of Self-Checking Boundary
Scannable Boards", Proc.ITC, 1992, IEEE,pp.372-81
6 - E Bohl, R Stephan, W Glauert,"The Architecture of the Fail-Stop
Controller AE11", IEEE IOLTW, Greece, 1997, pp.47-52.
7 -C Kuntzsch, F Mayer, K Ronge, "A Novel Approach for an On-Line
Selftest Architecture using ASIC Circuits in a Multi-Channel System",
IEEE IOLTW, Greece, 1997, pp165-8.
8 - IEEE Standard 1149.1 Test Access Port and Boundary-Scan Architecture,
IEEE Inc, NY, 1990.
9 - Texas Instruments IEEE 1149.1 Testability Primer, SSYA002B,
1994, http://www.ti.com/sc/docs/jtag/jtag2.htm
10 - V.D. Agrawal, "An Information Theoretic Approach to Digital
Testing", IEEE T. Computers, V.C-30,pp.582-7, 1981.
11 - J.V. Santos, J.M. Ferreira, "Failure Detection and Boundary
Scan: a Pseudo On-Line approach (POST)", 3rd IEEE IOLTW, Crete,
Greece, July 1997, pp160-4.
12 - A.L. Crouch, C. Pyron, "Impact of JTAG/1149.1 Testability on
Reliability", GOMAC 1989, pp83-90. |