A Fault Model for System Area Networks

Robert W. Horst
Tandem, a Compaq Company
19333 Vallco Parkway
Cupertino, CA 95014
Email: Bob.Horst@tandem.com

 

System Area Networks address the need for high-speed, reliable communications among processors and I/O devices in computing clusters. Designers of these networks have taken different approaches in ensuring reliability, and it is useful to have a framework for comparing these approaches.

SANs differ from other networks because they have been designed to be extremely aggressive in reducing end-to-end latency. The networks typically employ wormhole or cut through routing, allowing a single packet to be spread across multiple switches and links at any one point in time. In contrast to store-and-forward routing, it is not reasonable to assume that each switch is alone in a single fault domain. There are many subtle fault modes that can affect the integrity of the entire routing fabric.

SAN network interfaces must arbitrate with the main CPU for access to main memory. Interfaces using a standard bus, such as PCI, must be designed to allow the same network adapter to work reliably with a variety of different platforms. This creates other challenges, because each platform may exhibit different behavior during faults, and this behavior is rarely well specified or understood.

Many real-world faults can affect more than a single link in a routing fabric. These faults may include power supply failures, switch failures, faults affecting the ability for the network to access memory, and many different kinds of design errors (bugs) in programming the data transfer, error recovery, and management of the networks. A fault model needs to be general enough to include all sources of network outage.

Figure 1 shows a fault model that attempts to incorporate most of the errors encountered in real-world SANs. The left column shows the fault classification, the next column maps each classification into a set of specific faults, and the third column maps the specific faults into their manifestation into the error state of a single routing fabric. The final column maps the fabric errors into system errors. When the system is designed with multiple fabrics, if errors are confined to a single fabric, each error may have no impact on system-level integrity or availability. However, if there are not multiple fabrics, or if there are undetected or common mode failures in a multi-fabric system, the errors may show up at the system level as either a data corruption or application failure due to the inability to access critical resources.

The model shows that simple graph-theory models of networks are likely to overlook many of the possible error states. Richer fault models will be more useful in assuring that proposed networks meet their reliability goals. Networks can be enhanced to diminish or eliminate paths that end in either the Data Corruption or Application Halt states.

To take a specific example, consider a network such as a 2-D mesh where each node connects to four nearest neighbors. This network can tolerate single link failures and switch failures. But to tolerate power supply failures, the individual switches must be partitioned into power domains in a way that a single power fault cannot partition the fabric or remove critical resources. Similarly, even if the packets through the network carry CRC protection for the data fields, there must be other checks to prevent faults in the packet headers or switches from delivering packets out of order or to the wrong destination. The network must also be designed to prevent switch faults and flow-control faults from causing deadlocks, dropped packets, or mis-routed packets.

Dual fabric designs must also be designed to avoid common-mode failures that could take down both fabrics. The memory interface and fabric designs are critical in this case, because faults in a single memory cannot be allowed to cause infinite congestion on all fabrics, preventing any nodes from communicating. The ServerNet SAN [1,2] has been specifically designed to address these problems through a variety of self-checking techniques. The techniques include duplicated and compared state machines, path disable hardware to avoid deadlocks, timeout counters, and address validation firewalls. Two copies of a non-fault tolerant network are not likely to provide the same degree of reliability without changes to address all the potential common-mode error paths.

The fault model described in this paper can be used as a framework to examine many types of network fault-tolerance. It may help to provide guidance to designers and to groups doing fault-injection and verification of System Area Networks.

References

[1] R. W. Horst, "TNet: A Reliable System Area Network," IEEE Micro, Vol. 15, No. 1, pp. 37-45, February 1994.

[2] W. E. Baker, R. W. Horst, D. P. Sonnier, W. J. Watson, "A Flexible ServerNet-based Fault-Tolerant Architecture," in Proc. 25th Int. Symp. Fault-Tolerant Computing, Pasadena, CA, June 27-30 1995.