Low-Cost Data Integrity Checking Schemes for Cache Memories

Seongwoo Kim and Arun K. Somani 1
Department of Electrical and Computer Engineering
Iowa State University

Up to 60% of the chip area of recent microprocessors is dedicated to caches and other memory latency-hiding hardwares. Thus, the reliability of the memory hierarchy significantly affects overall system dependability. Soft (transient) errors have become major problems in semiconductor memories due to several factors: low-powered signal, swift data read/write operations, extremely increased circuit density, and external disturbances such as noise, power jitter, exorbitant heat and ionization by alpha-particle hits. If a soft error occurs in the cache and propagates to the processor, even a single-bit error can result in a complete system failure. To confine soft errors within the cache boundary and to eliminate the effects of the errors, redundant codes like parity and error-correcting code (ECC) are widely employed. Particularly, single error correcting-double error detecting (SEC-DED) codes are most popular. The problems found in conventional configuration of this redundant code-protected cache are as follows:

  • Current linear structure (i.e., every data unit of the cache is combined with a protection code: parity or ECC) requires high area overhead. For example, the data RAM portion of a 32 KB cache needs a 4 KB parity array for byte-parity scheme (one parity bit per chunk of 8-bit data). The redundancy requirement is directly proportional to the cache size. If area budget does not meet this requirement, data integrity checking is often ignored in low-end systems although the impacts of the system failure on ordinary users are becoming more unbearable.
  • Under extremely low fault rate (almost error free, but not zero) due to increasing quality of VLSI chips, the linear structure of the protection code can be excessive.
  • For enhanced reliability, extending ECCs to higher capability with linear structure is not practicable due to high cost and circuit complexity. Duplication and N-modular redundancy technique with voting can be adopted, but they are also very expensive.

To remedy these problems, the development of low-cost and area-flexible configuration that can achieve an acceptably high protection coverage is desirable. In [1], we proposed an error detection scheme called shadow caching. A small shadow cache is organized and managed in such a way that it contains a copy of frequently used items in the non-protected cache and performs run-time comparison to detect any number of error bits in the item. We extend our previous work [1] to develop new schemes for the configurations of code-protected cache, which allow trade-off between silicon area and level of data integrity so that the system designer can have more options to choose from.

The basic idea of our schemes is to exploit the nature of access locality for the purpose of effective code configuration. The conventional linear structure is reasonable only under the assumption that each data unit has the same probability of error occurrence. In practice, however, there are often reasons for errors to be more common in some positions than others: low noise margins can cause erroneous data change during read/write operations, and thus cache lines that get more accesses have higher probability of corruption; cross-coupling effects can induce errors in the adjacent locations of a line that is being accessed; and the impact of external disturbance is not always uniform over all locations.

Error propagation occurs only if a corrupted item is requested and a sufficient protection code is not associated with it. Errors in those lines which are not accessed for a long time have a high probability of being overwritten by new data without any impact on computation irrespective of existence of protection. In general, more than 20~50% injected errors are automatically removed by data writes and line replacement. One of the widely known program properties is that only 10% of program instructions are responsible for 90% of execution. Similar tendency can be observed in the data segments. With limited area budget, error checking should start from this portion. Therefore, in our locality-based configuration, any available area for data integrity checking is first assigned for the most frequently used portions assuming most-often-used lines are the most liable to be erroneous.

Under very low soft error rate, instead of keeping the protection code for every cache line, we maintain the code only for the most-often-used portions in a small table, named as parity cache (see Figure 1). Any type of ECC codes can be used without any algorithmic or structural change. Due to the reference locality, we expect that most of the time, the parity cache can provide the checking code for requested items in the main cache.

Although separate module of parity cache requires overheads: tag and status bits, comparators, sense amps, drivers, and control logics, its area occupancy is still very low. Figure 2 shows relative area ratios of a 64-entry parity cache to the ECC portion of a 16 KB normal cache with different associativities based on cache area model proposed in [2]. From our simulation studies, using software error injection, 98% of single-bit errors injected are covered by a parity cache of 38% less area than the conventional structure.

If applications require high data integrity, the capability of protection codes needs to be increased. Instead of linear expansion (see Figure 3), a cheaper approach is to enhance the code in a selective manner. The parity cache can be applied to a conventionally protected cache and can provide more intensive error checking for frequently used portion. As an alternative, in the case of multi-way set associative caches, a simple selective code expansion is possible as illustrated in Figure 4. The combination of the primary and the extra protection codes is denoted by enhanced protection. The primary protection code is separable from the enhanced protection code and it can still provide protection with less capability without the extra code. The extra code is built in such a way that the most recently used line in the set is protected by more intensive checking (i.e., more bits of detection and correction) in conjunction with primary protection code. From our simulation studies, this scheme allows only 6.8% multiple-bit errors to be propagated at the cost of only about 25% area of the linear expansion.

The conventional linear structure of the code-protected caches is not area effective. The locality-based configuration scheme proposed in this paper assists system designer in adopting an error protection scheme to satisfy both the reliability and area requirements.

References

[1] A. K. Somani and S. Kim, "Transient fault detection in cache memories by employing a small shadow cache," Proc. 6th Ann. Int'l Symp. Dependable Computing for Critical Applications , pp. 17-38, Mar. 1997.

[2] M. Mulder, N. T. Quach, and M. J. Flynn, "An area model for on-chip memories and its application," IEEE J. Solid-State Circuits, vol. 26, no.2, pp. 98-106, Feb. 1991.


1. Author contact: Dependable Computing Laboratory, Iowa State University, Ames, IA 50011. Phone: 515-294-0442. Fax: 515-294-8432. E-mail: arun@iastate.edu. This work was supported in part by National Foundation under Grant No. 9630058.