Hybrid Message Logging Protocols for Fast Recovery

 

Sriram Rao, Lorenzo Alvisi, and Harrick M. Vin
Department of Computer Sciences, UT Austin.

 

Introduction

Fast recovery has received little attention in the context of message logging protocols, which have instead focused on minimizing their overhead during failure-free executions. As distributed computing becomes commonplace, and many more applications are faced with the current costs of high availability, there is a fresh need for recovery-based techniques that combine high performance during failure-free executions with fast recovery.

As an initial step towards the development of these new techniques, we have implemented a sender-based pessimistic protocol[5], a receiver-based pessimistic protocol[2] and a causal protocol[1] and studied their performance during recovery[4] 1 .

All of these protocols log (1) the content and (2) the order of receipt of each message delivered by each process during its execution. Processes synchronously log on stable storage 2 the order of receipt, encoded in tuples called determinants. Receiver-based pessimistic protocols store on stable storage also the content of each message. However, this is not necessary: it can be shown that if all determinants are available during recovery, then the content of any message can be regenerated. Sender-based pessimistic and causal protocols take advantage of this observation to reduce logging overhead significantly during failure-free executions by keeping the contents of messages in volatile memory at the corresponding senders.

Table1 shows the results of our experiments. 3 Recovery in sender-based protocols is 2% to 20% slower than in receiver-based protocols when f=1, where f is the maximum number of concurrent failures. As f increases, however, sender-based protocols become as much as 390% slower. The cause for this peformance degradation is the increase in the time necessary to acquire the content of messages that have to be replayed during recovery. In receiver-based protocols these messages can be retrieved from stable storage, independent of the value of f.

In sender-based protocols, messages are stored only in the senders' volatile memory and are lost if the sender fails. Consequently, whenever a recovering process needs to acquire one of these missing messages, it stops rolling forward, waiting for the sender to recover to the point at which the message is regenerated and resent. We call this effect stop-and-go. As f increases, more messages are temporarily lost, and stop-and-go has an increasingly adverse effect on the overall recovery time.


 
Table 1: The cost of recovery in receiver-based pessimistic and sender-based (causal) protocols. These experiments are with 4 processes, and failures are induced approximately 3 minutes after a checkpoint. The last column shows the percentage increase in recovery cost of sender-based protocols when compared to receiver-based protocols.

Receivr

based

Sender

based

(Causal)

Application f Restore
State
(sec.)
Acquire
Logs
(sec.)
Roll
Forward
(sec.)
Total
Recovery
Time
(sec.)
Restore
State
(sec.)
Acquire
Logs
(sec.)
Roll
Forward
(sec.)
Total
Recovery
Time
(sec.)
Percentage
Increase
                         
1 0.31 1.4 30.28 31.99 0.36 3.67 30.74 34.77 9%
grid 2 0.34 1.43 32.1 33.87 0.41 7.52 80.69 88.62 162%
3 0.39 1.5 32.29 34.18 0.43 6.1 127.63 134.16 293%
                         
1 0.29 3.5 78.01 81.8 0.35 4.42 78.1 82.87 2%
nbody 2 0.41 3.7 78.19 82.3 0.47 10.64 112.89 124.0 50%
3 0.42 4.1 78.3 82.82 0.49 7.25 129.71 137.45 66%
                         
1 2.61 11.15 207.13 220.89 2.56 15.7 210.7 228.96 4%
gauss 2 3.1 13.15 208.7 224.95 3.22 25.3 229.88 258.4 15%
3 3.7 14.7 209.4 227.8 3.78 21.7 235.3 260.78 16%
                         
1 0.31 0.95 41.1 42.36 0.36 2.55 40.97 43.88 4%
life 2 0.41 1.3 43.36 45.07 0.45 6.1 108.84 115.39 156%
3 0.46 1.5 43.8 45.76 0.5 4.25 151.09 155.84 240%
                         
1 1.85 0.8 22.29 24.94 1.87 5.91 21.39 29.17 17%
p2fox 2 2.1 1.23 22.7 26.03 2.2 10.39 67.92 80.51 209%
3 2.7 1.64 22.78 27.12 2.65 8.43 122.24 133.32 390%
                         



Hybrid Protocols

Our results indicate that existing logging protocols exhibit a tradeoff between fast failure-free execution and fast recovery. To address this tradeoff, we introduce a new class of hybrid protocols. The objective of hybrid protocols is to maintain the failure-free performance of sender-based protocols while approaching the performance of receiver-based protocols during recovery. In hybrid protocols, messages are logged both at the sender and at the receiver. The sender synchronously logs messages in its volatile memory; the receiver asynchronously logs messages to stable storage. Since logging on stable storage is asynchronous, performance during failure-free executions is virtually identical to that of sender-based protocols. However, recovery is much faster, since stop-and-go cannot occur when the recovering process rolls forward using the messages available from stable storage. Note that in hybrid protocols stable storage does not contain all messages needed during recovery (as it did in receiver-based protocols). Missing messages are either available in the volatile memory of other operational processes, or have to be regenerated if the sender has failed. In either case, a process can roll forward using the messages on stable storage while in parallel acquires the missing messages.

We have implemented and evaluated hybrid versions of sender-based pessimistic and causal protocols. We show the results for hybrid causal protocols in Table2. Our experiments show that hybrid logging dramatically reduces the recovery cost of sender-based protocols, which are now within 30% of receiver-based protocols, even when f>1. At the same time, Table3 shows that when hybrid logging is used in place of sender-based logging, application execution time increases by at most 1%.

In conclusion, our experiments show that to avoid the stop and-go effect, logging protocls should not rely on other processes to providethe messages that have to be replayed during recovery. Hybrid protocols approach this goal, without imposing a high overhead during failure-free executions.


   
Table 2: Recovery cost of hybrid protocols for the same experimental environment of Table 1. The last column shows the percentage increase in recovery cost of hybrid protocols when compared to receiver-based protocols.
Application f Restore
State
(sec.)
Acquire
Logs
(sec.)
Roll
Forward
(sec.)
Total
Recovery
Time
(sec.)
Percentage
Increase
             
1 0.3 1.2 30.58 32.08 0%
grid 2 0.32 1.23 37.22 38.77 14%
3 0.42 1.25 42.5 44.17 29%
             
1 0.32 2.9 77.75 80.97 0%
nbody 2 0.35 2.86 89.8 93.01 13%
3 0.38 2.92 95.3 98.6 19%
             
1 2.75 9.8 211.01 223.56 1%
gauss 2 3.2 9.83 216.58 229.61 2%
3 3.68 9.91 223.5 237.09 4%
             
1 0.33 0.78 41.5 42.61 0%
life 2 0.42 0.81 47.3 48.53 7%
3 0.45 0.83 53.2 54.48 19%
             
1 1.9 0.6 22.6 25.1 0%
p2fox 2 2.25 0.7 33.5 36.45 11%
3 2.73 0.75 42.7 46.18 30%
             


Table 3: Overhead of hybrid logging on failure-free execution when compared to sender-based logging
Application Additional Overhead
grid 0.03%
nbody 0.057%
gauss 0.76%
life 0.041%
p2fox 0.2%


References

1 L. Alvisi and K. Marzullo, Message logging: Pessimistic, optimistic, causal, and optimal.
IEEE Transactions on Software Engineering, 24(2):149--159, February 1998.

2 A. Borg, J. Baumbach, and S. Glazer, A message system supporting fault tolerance.
In Proceedings of the Symposium on Operating Systems Principles, pages 90-99. ACM SIGOPS, October 1983.

3 O. P. Damani and V. K. Garg, How to recover efficiently and asynchronously when optimism fails.
In Proceedings of the 16th International Conference on Distributed Computing Systems, pages 108-115, 1996.

4 S. Rao, L. Alvisi, and H. Vin, The cost of recovery in message logging protocols.
Technical Report 98-02, University of Texas at Austin, March 1998.

5 R. E. Strom, D. F. Bacon, and S. A. Yemini, Volatile logging in n-fault-tolerant distributed systems.
In Proceedings of the Eighteenth Annual International Symposium on Fault-Tolerant Computing, pages 44-49, 1988.

 
1 Our study includes also a receiver-based optimistic protocol[3]. For the sake of space, in this abstract we focus on pessimistic and causal protocols, which try to speed up recovery by preventing the rollback of correct processes. A full discussion of our experiments can be found in[4]
 
2 Pessimistic protocols implement stable storage using the file server's disk, while causal protocols use the volatile memory of at least f+1 processes, where f is the maximum number of concurrent failures.
 
3 Since the performance during recovery of causal and sender-based pessimistic protocols does not vary significantly, we only show the results for the causal protocol.