Low-Cost Data Integrity Checking Schemes for Cache Memories
- Seongwoo Kim and Arun K. Somani 1
- Department of Electrical and Computer Engineering
- Iowa State University
Up to 60% of the chip area of recent microprocessors is dedicated to
caches and other memory latency-hiding hardwares. Thus, the reliability
of the memory hierarchy significantly affects overall system dependability.
Soft (transient) errors have become major problems in semiconductor memories
due to several factors: low-powered signal, swift data read/write operations,
extremely increased circuit density, and external disturbances such as noise,
power jitter, exorbitant heat and ionization by alpha-particle hits. If
a soft error occurs in the cache and propagates to the processor, even a
single-bit error can result in a complete system failure. To confine soft
errors within the cache boundary and to eliminate the effects of the errors,
redundant codes like parity and error-correcting code (ECC) are widely employed.
Particularly, single error correcting-double error detecting (SEC-DED) codes
are most popular. The problems found in conventional configuration of this
redundant code-protected cache are as follows:
- Current linear structure (i.e., every data unit of the cache is combined
with a protection code: parity or ECC) requires high area overhead. For
example, the data RAM portion of a 32 KB cache needs a 4 KB parity array
for byte-parity scheme (one parity bit per chunk of 8-bit data). The redundancy
requirement is directly proportional to the cache size. If area budget
does not meet this requirement, data integrity checking is often ignored
in low-end systems although the impacts of the system failure on ordinary
users are becoming more unbearable.
- Under extremely low fault rate (almost error free, but not zero) due
to increasing quality of VLSI chips, the linear structure of the protection
code can be excessive.
- For enhanced reliability, extending ECCs to higher capability with
linear structure is not practicable due to high cost and circuit complexity.
Duplication and N-modular redundancy technique with voting can be
adopted, but they are also very expensive.
To remedy these problems, the development of low-cost and area-flexible
configuration that can achieve an acceptably high protection coverage is
desirable. In [1], we proposed an error detection
scheme called shadow caching. A small shadow cache is organized and
managed in such a way that it contains a copy of frequently used items in
the non-protected cache and performs run-time comparison to detect any number
of error bits in the item. We extend our previous work [1]
to develop new schemes for the configurations of code-protected cache, which
allow trade-off between silicon area and level of data integrity so that
the system designer can have more options to choose from.
The basic idea of our schemes is to exploit the nature of access locality
for the purpose of effective code configuration. The conventional linear
structure is reasonable only under the assumption that each data unit has
the same probability of error occurrence. In practice, however, there are
often reasons for errors to be more common in some positions than others:
low noise margins can cause erroneous data change during read/write operations,
and thus cache lines that get more accesses have higher probability of corruption;
cross-coupling effects can induce errors in the adjacent locations of a
line that is being accessed; and the impact of external disturbance is not
always uniform over all locations.
Error propagation occurs only if a corrupted item is requested and a
sufficient protection code is not associated with it. Errors in those lines
which are not accessed for a long time have a high probability of being
overwritten by new data without any impact on computation irrespective of
existence of protection. In general, more than 20~50% injected
errors are automatically removed by data writes and line replacement. One
of the widely known program properties is that only 10% of program instructions
are responsible for 90% of execution. Similar tendency can be observed in
the data segments. With limited area budget, error checking should start
from this portion. Therefore, in our locality-based configuration, any available
area for data integrity checking is first assigned for the most frequently
used portions assuming most-often-used lines are the most liable to be erroneous.
Under very low soft error rate, instead of keeping the protection code
for every cache line, we maintain the code only for the most-often-used
portions in a small table, named as parity cache (see Figure 1).
Any type of ECC codes can be used without any algorithmic or structural
change. Due to the reference locality, we expect that most of the time,
the parity cache can provide the checking code for requested items in the
main cache.
Although separate module of parity cache requires overheads: tag and
status bits, comparators, sense amps, drivers, and control logics, its area
occupancy is still very low. Figure 2 shows relative area ratios of a 64-entry
parity cache to the ECC portion of a 16 KB normal cache with different associativities
based on cache area model proposed in [2]. From our
simulation studies, using software error injection, 98% of single-bit errors
injected are covered by a parity cache of 38% less area than the conventional
structure.
If applications require high data integrity, the capability of protection
codes needs to be increased. Instead of linear expansion (see Figure 3),
a cheaper approach is to enhance the code in a selective manner. The parity
cache can be applied to a conventionally protected cache and can provide
more intensive error checking for frequently used portion. As an alternative,
in the case of multi-way set associative caches, a simple selective code
expansion is possible as illustrated in Figure 4. The combination of the
primary and the extra protection codes is denoted by enhanced protection.
The primary protection code is separable from the enhanced protection code
and it can still provide protection with less capability without the extra
code. The extra code is built in such a way that the most recently used
line in the set is protected by more intensive checking (i.e., more bits
of detection and correction) in conjunction with primary protection code.
From our simulation studies, this scheme allows only 6.8% multiple-bit errors
to be propagated at the cost of only about 25% area of the linear expansion.
The conventional linear structure of the code-protected caches is not
area effective. The locality-based configuration scheme proposed in this
paper assists system designer in adopting an error protection scheme to
satisfy both the reliability and area requirements.
References
[1] A. K. Somani and S. Kim, "Transient fault
detection in cache memories by employing a small shadow cache," Proc.
6th Ann. Int'l Symp. Dependable Computing for Critical Applications
, pp. 17-38, Mar. 1997.
[2] M. Mulder, N. T. Quach, and M. J. Flynn, "An
area model for on-chip memories and its application," IEEE J. Solid-State
Circuits, vol. 26, no.2, pp. 98-106, Feb. 1991.
Author
contact: Dependable Computing Laboratory, Iowa State University, Ames, IA
50011. Phone: 515-294-0442. Fax: 515-294-8432. E-mail: arun@iastate.edu.
This work was supported in part by National Foundation under Grant No. 9630058. |