Fault Tolerant Communication Tools in Embedded Supercomputing
E. Verentziotis, A. Meliones, T. Varvarigou, G. Efthivoulidis,
A. Kontizas 1
National Technical University of Athens
Department of Electrical and Computer Engineering
Division of Computer Science
{dora, meliones, verentz}@telecom.ntua.gr
The very nature of a parallel
system, multiprocessor cooperation, seems to be both its powerful feature
and its fatal weakness. Communication between processors provides the means
to success for a parallel system, as long as it brings processors together
and accumulates their power. In distributed memory multiprocessor systems
(message passing) communication between nodes for the coherence of parallel
tasks can be either blocking (synchronous) or non-blocking (asynchronous).
In the synchronous mode problems arise when communication links or communicating
threads are in an erroneous state (broken links, threads in infinite loops
etc.) and thus communicating threads remain blocked, since communication
cannot be initiated or completed. On the other hand, in asynchronous communication,
problems arise when communicating threads are in an erroneous state or mailbox
mechanisms supporting asynchronous communication are malfunctioning. The
need for fault tolerant communication mechanisms is obvious; they are key
factors for a parallel system's dependability and therefore can unlock the
full system's capacity.
Within the ESPRIT project
EFTOS a framework is developed to integrate fault tolerance flexibly and
easily into distributed embedded HPC applications. This framework consists
of a variety of reusable fault tolerance modules acting at different levels.
The cost and performance overhead of generic Operating System and Hardware
level fault tolerance mechanisms are avoided, while at the same time the
burden of ad hoc fault tolerance programming is removed from the
application developers. The proposed framework employs a series of mechanisms
for communication error detection and recovery and operates at three levels:
At the lowest level,
it consists of tools for error detection (D-tools) and tools for error recovery
(R-tools). These are parameterisable functions that are dynamically started
during the execution of the application. When a D-tool detects an error,
it uses a standardized interface to pass specific information to the higher
level (DIRnet). The R-tools are started by the DIRnet and their role is
to recover the application after some error has occurred. These tools can
work either in combination with the higher levels or as stand-alone tools.
At the middle level,
the above tools are coordinated via a distributed control network, the Detection-Isolation-Recovery
network (DIRnet). The DIRnet is a hierarchical network that serves as a
backbone to pass information among the fault tolerance elements of the application
and allows distributed actions to be taken. It consists of a central manager
who has a global view of the system and several agents located in different
nodes, who interact with D/R-tools in their field and perform multiple coordinated
actions along the network, unfolding this way strategic schemes dictated
by the user. Agents are not interconnected, but can communicate via the
central manager, who is responsible for their cohesion.
At the highest level
, the above tools are combined into mechanisms that apply fault tolerance
to processing or communication modules. A custom language, specifying the
user's recovery strategy complements this.
A view of the FT framework
architecture is shown in Fig.1. The adaptation layer allows the generic
definition of the fault tolerance library interface both to the underlying
operating system and target hardware. The FT library has been implemented
on top of EPX, TEX and Win/NT platforms. This guarantees portability to
different architectures where these platforms are available.
Fig. 1. Fault
Tolerant Framework Architecture
Two methods are provided
to avoid problematic situations in the synchronous message-passing mode:
- The status of the communication link and
the communication partner are explicitly tested before blocking for communication.
- Communication is established normally but
timeout mechanisms are initiated to escape from faulty situations. Naturally,
both approaches can be used in combination.
Every time two threads
need to establish a communication channel a special thread is created by
each of them, the Channel Control Thread (CCT), which will control
the fault tolerant communication. The possibility of returning from blocked
communication is provided by using this separate thread. The CCT is used
to handle timeouts and to trigger recovery actions in co-operation with
the DIRnet (see Fig. 2). It is also responsible for handling isolation actions
(temporary or permanent deactivation of a communication channel) and recovery
actions (reactivation of a channel, creation of an alternative channel,
re-sending of a message). The existence and the actions of the CCT are transparent
to the application and are initiated only if a communication channel is
defined to be fault-tolerant.
Fig. 2. Fault
Tolerant Synchronous Communication with Dual CCTs
The two CCTs are synchronized
by a SYNC control signal, which is sent by the sending to the receiving
side within a timeout period (see Fig. 3). After the synchronization, a
READY signal is sent to the application threads by the CCTs and the application
threads are ready to exchange data through the CCTs. After the exchange
of data between the application threads, a second synchronization takes
place between the sender and receiver CCTs in the same way as before.
In the case of asynchronous
communication, communication problems are dealt with as follows: the sender,
upon sending its message, evokes a task (initiated at start-up) or creates
a thread, whose job is to monitor the status of the sender's message and
to inform accordingly the DIRnet and/or the sender. When the receiver retrieves
the mail from the mailbox, it notifies the monitoring task that the mail
was correctly (or not) received. The monitoring task waits for either such
a message from the receiver or a timeout. If this message arrives before
the timeout, it sends an appropriate message to the sender, if the sender
is in suspending mode. If timeout occurs, then the monitoring task concludes
that something is wrong with the receiver and informs the DIRnet about the
trouble. The DIRnet can then take some recovery actions. In case of a stand-alone
mechanism the recovery phase is built-in and is automatically initiated.
The result of these actions (successful or not) is then propagated via the
monitoring task to the sender. This mechanism corresponds to a single CCT
for control messages version of the mechanism that was devised for the synchronous
communication case.
Fig. 3. Protocol
used for Synchronous Communication on Fault Tolerant Links (Dual CCTs)
The presented communication
fault tolerance functionality has been successfully integrated in two real-time
embedded HPC applications: an image processing module in an automatic mail
processing system developed by Siemens ElektroCom, and a remotely controlled
automation system for electric high-voltage substations operated by ENEL
(Italian Electricity Provider). This resulted in valuable experience when
software fault injection was applied. Future work will move further on the
porting of the FT framework across different platforms and operating systems,
providing and integrating standard mechanisms for node to node interoperability.
Furthermore, the implementation of a fault tolerance middleware using emerging
standards, technologies and industrial initiatives (e.g. CORBA) will be
considered to guarantee the required level of dependability of object-oriented
open distributed systems.
References
[Ver 98] Verentziotis,
E., Meliones, A., Varvarigou, T., Efthivoulidis, G., Kontizas, A.: "Synchronous
and Asynchronous Fault Tolerant Communication in Embedded Supercomputing".
Int. Conf. on Parallel and Distributed Processing Techniques and Applications
(PDPTA'98). Las Vegas, July 13-16, 1998.
|