Fault Tolerant Communication Tools in Embedded Supercomputing

E. Verentziotis, A. Meliones, T. Varvarigou, G. Efthivoulidis, A. Kontizas 1
National Technical University of Athens
Department of Electrical and Computer Engineering
Division of Computer Science
{dora, meliones, verentz}@telecom.ntua.gr

The very nature of a parallel system, multiprocessor cooperation, seems to be both its powerful feature and its fatal weakness. Communication between processors provides the means to success for a parallel system, as long as it brings processors together and accumulates their power. In distributed memory multiprocessor systems (message passing) communication between nodes for the coherence of parallel tasks can be either blocking (synchronous) or non-blocking (asynchronous). In the synchronous mode problems arise when communication links or communicating threads are in an erroneous state (broken links, threads in infinite loops etc.) and thus communicating threads remain blocked, since communication cannot be initiated or completed. On the other hand, in asynchronous communication, problems arise when communicating threads are in an erroneous state or mailbox mechanisms supporting asynchronous communication are malfunctioning. The need for fault tolerant communication mechanisms is obvious; they are key factors for a parallel system's dependability and therefore can unlock the full system's capacity.

Within the ESPRIT project EFTOS a framework is developed to integrate fault tolerance flexibly and easily into distributed embedded HPC applications. This framework consists of a variety of reusable fault tolerance modules acting at different levels. The cost and performance overhead of generic Operating System and Hardware level fault tolerance mechanisms are avoided, while at the same time the burden of ad hoc fault tolerance programming is removed from the application developers. The proposed framework employs a series of mechanisms for communication error detection and recovery and operates at three levels:

At the lowest level, it consists of tools for error detection (D-tools) and tools for error recovery (R-tools). These are parameterisable functions that are dynamically started during the execution of the application. When a D-tool detects an error, it uses a standardized interface to pass specific information to the higher level (DIRnet). The R-tools are started by the DIRnet and their role is to recover the application after some error has occurred. These tools can work either in combination with the higher levels or as stand-alone tools.

At the middle level, the above tools are coordinated via a distributed control network, the Detection-Isolation-Recovery network (DIRnet). The DIRnet is a hierarchical network that serves as a backbone to pass information among the fault tolerance elements of the application and allows distributed actions to be taken. It consists of a central manager who has a global view of the system and several agents located in different nodes, who interact with D/R-tools in their field and perform multiple coordinated actions along the network, unfolding this way strategic schemes dictated by the user. Agents are not interconnected, but can communicate via the central manager, who is responsible for their cohesion.

At the highest level , the above tools are combined into mechanisms that apply fault tolerance to processing or communication modules. A custom language, specifying the user's recovery strategy complements this.

A view of the FT framework architecture is shown in Fig.1. The adaptation layer allows the generic definition of the fault tolerance library interface both to the underlying operating system and target hardware. The FT library has been implemented on top of EPX, TEX and Win/NT platforms. This guarantees portability to different architectures where these platforms are available.

Fig. 1. Fault Tolerant Framework Architecture

Two methods are provided to avoid problematic situations in the synchronous message-passing mode:

  • The status of the communication link and the communication partner are explicitly tested before blocking for communication.
  • Communication is established normally but timeout mechanisms are initiated to escape from faulty situations. Naturally, both approaches can be used in combination.

Every time two threads need to establish a communication channel a special thread is created by each of them, the Channel Control Thread (CCT), which will control the fault tolerant communication. The possibility of returning from blocked communication is provided by using this separate thread. The CCT is used to handle timeouts and to trigger recovery actions in co-operation with the DIRnet (see Fig. 2). It is also responsible for handling isolation actions (temporary or permanent deactivation of a communication channel) and recovery actions (reactivation of a channel, creation of an alternative channel, re-sending of a message). The existence and the actions of the CCT are transparent to the application and are initiated only if a communication channel is defined to be fault-tolerant.

Fig. 2. Fault Tolerant Synchronous Communication with Dual CCTs

The two CCTs are synchronized by a SYNC control signal, which is sent by the sending to the receiving side within a timeout period (see Fig. 3). After the synchronization, a READY signal is sent to the application threads by the CCTs and the application threads are ready to exchange data through the CCTs. After the exchange of data between the application threads, a second synchronization takes place between the sender and receiver CCTs in the same way as before.

In the case of asynchronous communication, communication problems are dealt with as follows: the sender, upon sending its message, evokes a task (initiated at start-up) or creates a thread, whose job is to monitor the status of the sender's message and to inform accordingly the DIRnet and/or the sender. When the receiver retrieves the mail from the mailbox, it notifies the monitoring task that the mail was correctly (or not) received. The monitoring task waits for either such a message from the receiver or a timeout. If this message arrives before the timeout, it sends an appropriate message to the sender, if the sender is in suspending mode. If timeout occurs, then the monitoring task concludes that something is wrong with the receiver and informs the DIRnet about the trouble. The DIRnet can then take some recovery actions. In case of a stand-alone mechanism the recovery phase is built-in and is automatically initiated. The result of these actions (successful or not) is then propagated via the monitoring task to the sender. This mechanism corresponds to a single CCT for control messages version of the mechanism that was devised for the synchronous communication case.

Fig. 3. Protocol used for Synchronous Communication on Fault Tolerant Links (Dual CCTs)

The presented communication fault tolerance functionality has been successfully integrated in two real-time embedded HPC applications: an image processing module in an automatic mail processing system developed by Siemens ElektroCom, and a remotely controlled automation system for electric high-voltage substations operated by ENEL (Italian Electricity Provider). This resulted in valuable experience when software fault injection was applied. Future work will move further on the porting of the FT framework across different platforms and operating systems, providing and integrating standard mechanisms for node to node interoperability. Furthermore, the implementation of a fault tolerance middleware using emerging standards, technologies and industrial initiatives (e.g. CORBA) will be considered to guarantee the required level of dependability of object-oriented open distributed systems.

References

[Ver 98] Verentziotis, E., Meliones, A., Varvarigou, T., Efthivoulidis, G., Kontizas, A.: "Synchronous and Asynchronous Fault Tolerant Communication in Embedded Supercomputing". Int. Conf. on Parallel and Distributed Processing Techniques and Applications (PDPTA'98). Las Vegas, July 13-16, 1998.


1. Authors contact: National Technical University of Athens, Department of Electrical and Computer Engineering, Division of Computer Science, Zografou Campus, 9 Iroon Politechniou Str., Athens 15773, Greece, Phone: +30-1-7721478. Fax: +30-1-7722534. E-Mail: {dora, meliones, verentz}@telecom.ntua.gr