Software Implemented Fault Tolerance in Hypercube
(Experimental System)
D.R.Avresky and S. Geoghegan
Dept. of Electrical and Computer Eng. Computer Science
Dept.
Boston University
Texas A M University
8 St. Mary's Street
College Station, TX 77840
Boston, MA 02215
Phone:(617) 353-9850
Fax: (617) 353-6440
E-mail: avresky@bu.edu
Even though many researchers have tried to solve the problem of tolerating
faulty nodes in hypercube computers, most, if not all, of the proposed solutions
have been very costly to implement. The approach taken here will be
to use a spanning tree for reconfiguration
of hypercubes under faulty nodes. A spanning tree is a connected graph that
spans the nodes of the graph, forming a tree with no cycles. Software Implemented
Fault Tolerance (SIFT) in hypercubes is implemented by means of a software
layer, written in each node of the nCube parallel computer, error detection
application software and fast reconfiguration algorithm for avoiding faulty
nodes. The Balance Spanning Tree (BST) is used for embedding tree-based
algorithms into the hypercube topology. Any single node fault in the hypercube
can be tolerated by the software layer. More than 90 of the multiple faults
can be tolerated without backtracking. Each node periodically sends
``I am alive'' messages to its neighbors. The faults in the application
at each node are injected by means of the parallel version of the Software
Object-Oriented Fault Injection Tool (nSOFIT). The nSOFIT injects
faults into the memory, data and address buses and registers of each node
in the hypercube. nSofit uses a set of primitives that will work under
many operating systems and provide a virtual process image for a process,
irrespective of the underlying RISC or CISC architecture. The set of primitives
is presented in the C++ class structure. To test the portability of SOFIT
into nCube, a study was conducted using a quadtree data compression algorithm.
In this algorithm, the fault detection mechanism is implemented by duplicating
the memory used by program variables. In this experiment, faults of
10 cycle duration were injected into the address bus, CPU, data bus, and
memory. For these experiments, a confidence level and accuracy r = 0.05
were chosen. The coverage factor was determined using nSOFIT to inject
faults into node 7 as the quadtree algorithm was executed on a 16 node hypercube.
In this experiment, a 128x128 uncompressible array was used as input.
To test the reconfiguration algorithm, the quadtree data compression
algorithm was written and executed on a BST embedded in a hypercube topology.
Three separate test were run to measure various attributes of the quadtree
algorithm. First, the overhead incurred by the reconfiguration step
was measured. Next, the overhead incurred by the various fault-tolerance
methods was measured. Finally, the percentage of fault detected achieved
by the fault tolerance methods was measured. As seen in the reconfiguration
involves several nodes. Any child, of the faulty node, must detect
the node, compute an alternate parent, , and send , a reconfiguration message.
The node in turn must receive the reconfiguration message, add to its list
of children, and send the reconfiguration message to. Finally, receives
the reconfiguration message, adds to its list of children. As the
program was executing, the node was forced to fail by the fault injection
tool. Three different overhead values are obtained since each of the
three nodes involved in the reconfiguration process performs a different
task. As seen by the values in Table , the overhead incurred in each
node is very small and does not vary with the size of the BST.
The next experiment was designed to determine the time overhead incurred
by the fault tolerance methods. In this example, two methods were
chosen. First, the ``I'm Alive'' is sent between the nodes. Second,
each variable in the program is duplicated using C++ classes to redefine
the basic data types. The results of these tests are summarized in Table.
As can be seen in Table the recomputation overhead decreases
as the hybercube dimension increases. This is because the data is split
into smaller partitions when more nodes are present. The two contributing
factors to the overhead incurred when tolerating a node fault are the reconfiguration
and recomputation times. The data from these experiments show that the reconfiguration
time remains constant and the recomputation time decreases as the size of
the hypercube increases, i.e., the proposed algorithm is scalable.
A fault tolerance method implemented on a hypercube topology has been
discussed in this paper. The technique is applicable to the set of
parallel algorithms that use the divide and conquer approach and can be
mapped onto a tree topology. The implemented software layer can be used
for real-time parallel tree based applications,
when the time constrains should by met despite errors in the hypercube.
|