The Taxonomy of Design Faults in COTS Microprocessors

Algirdas Avizienis and Yutao He
Dependable Computing and Fault Tolerance Laboratory
UCLA Computer Science Department
University of California, Los Angeles, CA 90095-1596
E-mail: {aviz,yutao}@cs.ucla.edu

Introduction

The constantly increasing complexity of COTS microprocessors has led to a new kind of publication: a monthly ``update'' that lists specification changes and design faults found in the microprocessor since the initial release. For example, the Intel Pentium II has 47 design faults (``errata'') from May 97 to April 98 [1]. For 25 of these 47 faults no ``fix'' is planned, i.e., they will remain in future versions (``steppings'') of Pentium II.

The ``errata'' in [1] are listed in the chronological order of their discovery without any further classification. It is our goal to present a Design Fault Taxonomy (DFT) that classifies design faults according to a set of orthogonal relationships. The insights gained from this taxonomy should lead to three benefits: (1) the improvement of verification techniques; (2) effective utilization of such ``imperfect'' microprocessors in dependability-critical applications; and (3) refinement of design methodologies [2] in order to avoid similar design faults in future microprocessors.

 

The DFT Methodology

The DFT utilizes a set of p-design faults, a system model, a design fault model, a distribution of design faults in terms of the fault model, and a classification procedure.


2.1 Definition of p-Design Faults. The p-design faults in our study are defined as human-made faults that are created during the design of microprocessors but discovered only after the chips have been manufactured, i.e., during the post-silicon stage. The p-design faults are interesting because they have eluded extensive pre-silicon verification processes and their impacts are ``permanent'' in that they may be alleviated only partially with modification of firmware or software.


2.2 The System Model. A given microprocessor is divided into two logic parts, Performance Delivery Architecture (PDA) and Confidence-Assurance Architecture (CAA) at the top level of abstraction. Further partition into functional elements can be done in a hierarchical manner. The purpose of the PDA is to deliver the desired performance. An example of an element of the PDA is the floating-point unit. The goal of the CAA is to assure the ability of the PDA to deliver the expected performance. The BIST logic is an example of an element of the CAA.


2.3 p-Design Fault Model. An effective classification scheme should be orthogonal to eliminate ambiguity and inferential to disclose cause-effect relationships. The selected attributes are shown in Figure 1:

$\mu$


  
Figure 1: The p-Design Fault Model

\begin{figure}
\centering
\includegraphics[width=0.4
\textwidth]{taxonomyf2.eps}\end{figure}


Location: It is defined by the logic partitioning of a microprocessor and consists of the PDA, the CAA and their logic elements.

Type: It describes which aspect of the implementation deviates from the specification due to a p-design fault: (1) function, (2) timing, (3) data, and (4) control.

Triggering condition: It characterizes the operational environments under which a p-design fault is activated and includes: (1) configurations, (2) operation modes, and (3) triggering dependency.

Effect: It describes the effects (errors) caused by a p-design fault and includes (1) severity and (2) affected functions.



2.4 Classification Procedure. It consists of four steps: (1) Collection of data, (2) Acquisition of the system model, (3) Refinement of the p-design fault model, and (4) Analysis of p-design faults.

 

A Case Study: Design Faults in Intel Pentium II Microprocessors

An in-depth study of the 47 p-design faults documented for the Intel Pentium II microprocessor from May, 1997 to April, 1998 [1] has been conducted as a proof-of-the-concept example to illustrate the proposed taxonomy. Classification of 47 faults by type is shown in Table 1.

 
Table 1: Classification by Type


\begin{tabular}
{\vert c\vert\vert c\vert c\vert c\vert c\vert} \hline\hline
{\b...
 ... \ {\bf Fraction (\%)} & 14.9 & 6.4 & 19.1 & 59.6 \  \hline\hline\end{tabular}  


3.1 Where do p-design faults occur more frequently? Our study shows that 34 (over 72%) faults occur in the PDA, where over half of the faults cluster in such complex components as memory management unit and floating-point unit (FPU). In particular, among the five faults in FPU, four lead to incorrect results and one causes a timing delay in exception reporting. Given that this FPU design has been used since the Pentium processor, it seems that the verification techniques aimed for control-intensive components may not be as effective for arithmetic components, whose algorithmic features should be exploited.

It is also observed that only 13 (about 28%) faults occur in the CAA. This does not imply that the CAA is better verified than the PDA. On the contrary, when the relative size of the CAA (5%) is taken into account by defining the fault density as the number of faults per function (measured by the number of transistors), it is seen that the PDA has the fault density of 34/.95 = 36, while the CAA has the fault density of 13/.05 = 260, over 7 times that of the CAA.

In addition, the analysis of the triggering conditions of 13 p-design faults in the CAA reveals that the majority of them are activated by a single event. For example, the BIST always signals ``fail'' regardless of whether the test actually passes or fails (No. 35 in [1]). The results indicate that verification of the CAA has been not as good as that of the PDA. We note that the verification of CAA requires exercising its functions under the expected abnormal conditions through fault injection - a difficult task.



3.2 How hard is it to activate a p-design fault? While the study shows that over half of the faults are activated by multiple events, about 42% of the faults are activated by a single event. For example, the fault in the L2 cache performance monitoring counter causes an incorrect count value (No. 23 in [1]). It is surprising that the verification processes have not been able to catch it.



3.3 Are p-design faults severe? The study shows that most design faults are non-trivial, with a large portion causing either a crash, a hang, or the denial of a function. In particular, about 49% of all design faults affect CAA-related functions; for example, the error detection function is affected by eleven faults. Furthermore, only a part of the design faults are ``fixed'' (eliminated) in future ``steppings'' (modified versions) of the Pentium II processors. Over half of these CAA-related faults have not been planned to be fixed thus far, which poses tremendous challenges for building COTS-based fault-tolerant systems. Consequently, the trustworthiness of these COTS microprocessors cannot be taken for granted and needs to be assessed carefully when applied in a fault-tolerant system.

 

Conclusions and Future Work

Our study shows that the verification effort has not been able to keep up with the fast growth of complexity of modern COTS microprocessors. In particular, verification techniques for arithmetic components and for the CAA need attention. As a result, while the solutions to the problem ``Can a COTS microprocessor deliver high performance?'' have been pursued relentlessly and continuous progress takes place, the problem ``Can a high-performance COTS microprocessor deliver high confidence?'' has been mostly ignored and remains challenging. Given the situation, any effective COTS-based fault tolerance techniques need to address the ``imperfection'' issue of COTS microprocessors.

Currently, research is under way to refine the DFT methodology by the study of p-design faults in Pentium and Pentium Pro microprocessors and to develop effective COTS-based fault tolerance techniques [3].

 

References

1
Intel Corporation.
Pentium II Processor Specification Update, April 8 1998.
Order No: 243337-013.
2
Algirdas Avizienis.
Toward systematic design of fault-tolerant systems.
Computer, 30(4):51-58, April 1997.
3
Algirdas Avizienis and Yutao He.
The taxonomy of design faults in COTS microprocessors.
Technical Report, UCLA Computer Science Dept., June 1998.