Software Implemented Fault Tolerance in Hypercube

(Experimental System)

D.R.Avresky and S. Geoghegan

                    Dept. of Electrical and Computer Eng.    Computer Science Dept.
                    Boston University                                     Texas A M University
                    8 St. Mary's Street                                   College Station, TX 77840
                    Boston, MA 02215
                    Phone:(617) 353-9850
                    Fax: (617) 353-6440
                    E-mail: avresky@bu.edu

Even though many researchers have tried to solve the problem of tolerating faulty nodes in hypercube computers, most, if not all, of the proposed solutions have been very costly to implement.  The approach taken here will be to use a spanning tree for reconfiguration
of hypercubes under faulty nodes. A spanning tree is a connected graph that spans the nodes of the graph, forming a tree with no cycles. Software Implemented Fault Tolerance (SIFT) in hypercubes is implemented by means of a software layer, written in each node of the nCube parallel computer, error detection application software and fast reconfiguration algorithm for avoiding faulty nodes.  The Balance Spanning Tree (BST) is used for embedding tree-based algorithms into the hypercube topology. Any single node fault in the hypercube can be tolerated by the software layer. More than 90 of the multiple faults can be tolerated without backtracking.  Each node periodically sends ``I am alive'' messages to its neighbors.  The faults in the application at each node are injected by means of the parallel version of the Software Object-Oriented Fault Injection Tool (nSOFIT).  The nSOFIT injects faults into the memory, data and address buses and registers of each node in the hypercube.  nSofit uses a set of primitives that will work under many operating systems and provide a virtual process image for a process, irrespective of the underlying RISC or CISC architecture. The set of primitives is presented in the C++ class structure. To test the portability of SOFIT into nCube, a study was conducted using a quadtree data compression algorithm.  In this algorithm, the fault detection mechanism is implemented by duplicating the memory used by program variables.  In this experiment, faults of 10 cycle duration were injected into the address bus, CPU, data bus, and memory. For these experiments, a confidence level and accuracy r = 0.05 were chosen.  The coverage factor was determined using nSOFIT to inject faults into node 7 as the quadtree algorithm was executed on a 16 node hypercube.  In this experiment, a 128x128 uncompressible array was used as input.

To test the reconfiguration algorithm, the quadtree data compression algorithm was written and executed on a BST embedded in a hypercube topology. Three separate test were run to measure various attributes of the quadtree algorithm.  First, the overhead incurred by the reconfiguration step was measured.  Next, the overhead incurred by the various fault-tolerance methods was measured.  Finally, the percentage of fault detected achieved by the fault tolerance methods was measured. As seen in the reconfiguration involves several nodes.  Any child, of the faulty node, must detect the node, compute an alternate parent, , and send , a reconfiguration message.  The node in turn must receive the reconfiguration message, add to its list of children, and send the reconfiguration message to.  Finally, receives the reconfiguration message, adds to its list of children.  As the program was executing, the node was forced to fail by the fault injection tool.  Three different overhead values are obtained since each of the three nodes involved in the reconfiguration process performs a different task.  As seen by the values in Table , the overhead incurred in each node is very small and does not vary with the size of the BST.

The next experiment was designed to determine the time overhead incurred by the fault tolerance methods.  In this example, two methods were chosen.  First, the ``I'm Alive'' is sent between the nodes. Second, each variable in the program is duplicated using C++ classes to redefine the basic data types. The results of these tests are summarized in Table. As can be seen in Table   the recomputation overhead decreases as the hybercube dimension increases. This is because the data is split into smaller partitions when more nodes are present.  The two contributing factors to the overhead incurred when tolerating a node fault are the reconfiguration and recomputation times. The data from these experiments show that the reconfiguration time remains constant and the recomputation time decreases as the size of the hypercube increases, i.e., the proposed algorithm is scalable.

A fault tolerance method implemented on a hypercube topology has been
discussed in this paper.  The technique is applicable to the set of parallel algorithms that use the divide and conquer approach and can be mapped onto a tree topology. The implemented software layer can be used for real-time parallel tree based applications,
when the time constrains should by met despite errors in the hypercube.