|
|
| |
|
![]() |
Location: It is defined by the logic partitioning of a microprocessor and consists of the PDA, the CAA and their logic elements.
Type: It describes which aspect of the implementation deviates from the specification due to a p-design fault: (1) function, (2) timing, (3) data, and (4) control.
Triggering condition: It characterizes the operational environments under which a p-design fault is activated and includes: (1) configurations, (2) operation modes, and (3) triggering dependency.
Effect: It describes the effects (errors) caused by a p-design fault and includes (1) severity and (2) affected functions.
2.4 Classification Procedure. It consists of four steps: (1) Collection
of data, (2) Acquisition of the system model, (3) Refinement
of the p-design fault model, and (4) Analysis of p-design faults.
An in-depth study of the 47 p-design faults documented for the Intel
Pentium II microprocessor from May, 1997 to April, 1998 [1]
has been conducted as a proof-of-the-concept example to illustrate the proposed
taxonomy. Classification of 47 faults by type is shown in Table 1.
|
3.1 Where do p-design faults occur more frequently? Our study shows
that 34 (over 72%) faults occur in the PDA, where over half of the faults
cluster in such complex components as memory management unit and floating-point
unit (FPU). In particular, among the five faults in FPU, four lead to incorrect
results and one causes a timing delay in exception reporting. Given that
this FPU design has been used since the Pentium processor, it seems that
the verification techniques aimed for control-intensive components may not
be as effective for arithmetic components, whose algorithmic features should
be exploited.
It is also observed that only 13 (about 28%) faults occur in the CAA. This does not imply that the CAA is better verified than the PDA. On the contrary, when the relative size of the CAA (5%) is taken into account by defining the fault density as the number of faults per function (measured by the number of transistors), it is seen that the PDA has the fault density of 34/.95 = 36, while the CAA has the fault density of 13/.05 = 260, over 7 times that of the CAA.
In addition, the analysis of the triggering conditions of 13 p-design faults in the CAA reveals that the majority of them are activated by a single event. For example, the BIST always signals ``fail'' regardless of whether the test actually passes or fails (No. 35 in [1]). The results indicate that verification of the CAA has been not as good as that of the PDA. We note that the verification of CAA requires exercising its functions under the expected abnormal conditions through fault injection - a difficult task.
3.2 How hard is it to activate a p-design fault? While the study
shows that over half of the faults are activated by multiple events, about
42% of the faults are activated by a single event. For example, the fault
in the L2 cache performance monitoring counter causes an incorrect count
value (No. 23 in [1]). It is surprising
that the verification processes have not been able to catch it.
3.3 Are p-design faults severe? The study shows that most design
faults are non-trivial, with a large portion causing either a crash, a hang,
or the denial of a function. In particular, about 49% of all design faults
affect CAA-related functions; for example, the error detection function
is affected by eleven faults. Furthermore, only a part of the design faults
are ``fixed'' (eliminated) in future ``steppings'' (modified versions) of
the Pentium II processors. Over half of these CAA-related faults have not
been planned to be fixed thus far, which poses tremendous challenges for
building COTS-based fault-tolerant systems. Consequently, the trustworthiness
of these COTS microprocessors cannot be taken for granted and needs to be
assessed carefully when applied in a fault-tolerant system.
Our study shows that the verification effort has not been able to keep up with the fast growth of complexity of modern COTS microprocessors. In particular, verification techniques for arithmetic components and for the CAA need attention. As a result, while the solutions to the problem ``Can a COTS microprocessor deliver high performance?'' have been pursued relentlessly and continuous progress takes place, the problem ``Can a high-performance COTS microprocessor deliver high confidence?'' has been mostly ignored and remains challenging. Given the situation, any effective COTS-based fault tolerance techniques need to address the ``imperfection'' issue of COTS microprocessors.
Currently, research is under way to refine the DFT methodology by the study of p-design faults in Pentium and Pentium Pro microprocessors and to develop effective COTS-based fault tolerance techniques [3].