A Fault Model for System Area Networks
- Robert W. Horst
- Tandem, a Compaq Company
- 19333 Vallco Parkway
- Cupertino, CA 95014
- Email: Bob.Horst@tandem.com
System Area Networks address the need for high-speed, reliable communications
among processors and I/O devices in computing clusters. Designers of these
networks have taken different approaches in ensuring reliability, and it
is useful to have a framework for comparing these approaches.
SANs differ from other networks because they have been designed to be
extremely aggressive in reducing end-to-end latency. The networks typically
employ wormhole or cut through routing, allowing a single packet to be spread
across multiple switches and links at any one point in time. In contrast
to store-and-forward routing, it is not reasonable to assume that each switch
is alone in a single fault domain. There are many subtle fault modes that
can affect the integrity of the entire routing fabric.
SAN network interfaces must arbitrate with the main CPU for access to
main memory. Interfaces using a standard bus, such as PCI, must be designed
to allow the same network adapter to work reliably with a variety of different
platforms. This creates other challenges, because each platform may exhibit
different behavior during faults, and this behavior is rarely well specified
or understood.
Many real-world faults can affect more than a single link in a routing
fabric. These faults may include power supply failures, switch failures,
faults affecting the ability for the network to access memory, and many
different kinds of design errors (bugs) in programming the data transfer,
error recovery, and management of the networks. A fault model needs to be
general enough to include all sources of network outage.
Figure 1 shows a fault model that attempts to incorporate most of the
errors encountered in real-world SANs. The left column shows the fault classification,
the next column maps each classification into a set of specific faults,
and the third column maps the specific faults into their manifestation into
the error state of a single routing fabric. The final column maps the fabric
errors into system errors. When the system is designed with multiple fabrics,
if errors are confined to a single fabric, each error may have no impact
on system-level integrity or availability. However, if there are not multiple
fabrics, or if there are undetected or common mode failures in a multi-fabric
system, the errors may show up at the system level as either a data corruption
or application failure due to the inability to access critical resources.
The model shows that simple graph-theory models of networks are likely
to overlook many of the possible error states. Richer fault models will
be more useful in assuring that proposed networks meet their reliability
goals. Networks can be enhanced to diminish or eliminate paths that end
in either the Data Corruption or Application Halt states.
To take a specific example, consider a network such as a 2-D mesh where
each node connects to four nearest neighbors. This network can tolerate
single link failures and switch failures. But to tolerate power supply failures,
the individual switches must be partitioned into power domains in a way
that a single power fault cannot partition the fabric or remove critical
resources. Similarly, even if the packets through the network carry CRC
protection for the data fields, there must be other checks to prevent faults
in the packet headers or switches from delivering packets out of order or
to the wrong destination. The network must also be designed to prevent switch
faults and flow-control faults from causing deadlocks, dropped packets,
or mis-routed packets.
Dual fabric designs must also be designed to avoid common-mode failures
that could take down both fabrics. The memory interface and fabric designs
are critical in this case, because faults in a single memory cannot be allowed
to cause infinite congestion on all fabrics, preventing any nodes from communicating.
The ServerNet SAN [1,2] has been specifically designed to address these
problems through a variety of self-checking techniques. The techniques include
duplicated and compared state machines, path disable hardware to avoid deadlocks,
timeout counters, and address validation firewalls. Two copies of a non-fault
tolerant network are not likely to provide the same degree of reliability
without changes to address all the potential common-mode error paths.
The fault model described in this paper can be used as a framework to
examine many types of network fault-tolerance. It may help to provide guidance
to designers and to groups doing fault-injection and verification of System
Area Networks.
References
[1] R. W. Horst, "TNet: A Reliable System Area Network," IEEE Micro, Vol. 15, No. 1, pp. 37-45, February
1994.
[2] W. E. Baker, R. W. Horst, D. P. Sonnier, W. J. Watson, "A Flexible
ServerNet-based Fault-Tolerant Architecture," in Proc. 25th Int. Symp. Fault-Tolerant Computing,
Pasadena, CA, June 27-30 1995.
|