| |
FTCS-27
The Twenty-Seventh Annual International Symposium
on Fault
Tolerant Computing
June 24 -27, 1997, Seattle, Washington USA
OPENING SESSION
1. Opening Remarks
2. Carter Award Presentation - R. Iyer and J. C. Laprie
3. Keynote Address
SESSION 1A, DISTRIBUTED SYSTEMS 10:30-12:00
1. Managing Dependencies - A Key Problem in Fault-Tolerant Distributed
Algorithms
P. Theisohn, E. Nett, and M. Mock, German National Research Center
for
Information Technology, St. Augustin, Germany
2. An Approach to Fault-Tolerant Parallel Processing on
Intermittently Idle Heterogeneous Workstations
K. Jeong, University of Florida, USA, D. Shasha, S. Talla, and P.
Wyckoff,
Courant Institute of Mathematical Sciences, New York University,
USA
3. Renegotiable Quality of Service - A New Scheme for Fault Tolerance
in Wireless Networks
T.-W. Chen, P. Krzyzanouwski, M. R. Lyu, C. Sreenan, and J. Trotter,
Lucent
Bell Laboratories, USA
SESSION 1B, DEPENDABILITY EVALUATION 10:30-12:00
1. VERIFY: Evaluation of Reliability Using VHDL-Models with Embedded
Fault Descriptions
V. Sieh, O. Tsch=E4che, and F. Balbach, Universit=E4t Erlangen-N=FCrnberg,
Germany
2. Toward Accessibility Enhancement of Dependability Modeling
Techniques
A. T. Tai, IA Tech.Inc.. H. Hecht, and B. Zhang, SoHar Inc., USA,
K. S. Trivedi, Duke University, USA,
3. Evaluation of a 32-bit Microprocessor with Built-in Concurrent
Error-Detection
J. Gaisler, European Space Research and Technology Centre, The Netherlands
SESSION 2A, CHECKPOINTING AND RECOVERY 1:30-3:00
1. Probabilistic Checkpointing
H.-C. Nam, J. Kim, S. Hong, and S. Lee, Pohang University of Science
and
Technology, Pohang, Korea
2. Portable Checkpointing for Heterogeneous Architectures
B. Ramkumar, University of Iowa, USA and V. Strumpen, Massachusetts
Institute of Technology, USA
3. A Communication-Induced Checkpointing Protocol that Ensures
Rollback Dependency Trackability
R. Baldoni, University of Rome, Italy, J.-M. H=E9lary, A. Mostefaoui,
and
M.
Raynal, IRISA, Rennes Cedex, France
SESSION 2B, USER INTERFACES AND OBJECT-ORIENTED TESTING 1:30-3:00
1. A Methodology to Automate User Interface Testing Using Variable
Finite State Machines
R. Shehady and D. P. Siewiorek, Carnegie Mellon University,
Pittsburgh, USA
2. MetriStation: A Tool for User-Interface Fault Detection
R. A. Maxion and P. A. Syme, Carnegie Mellon University, Pittsburgh,
USA
3. Towards a Statistical Approach to Testing Object-Oriented Programs
P. Thevenod-Fosse and H. Waeselynck, LAAS, Toulouse Cedex, France
SESSION 3A, WORK IN PROGRESS AND OUTRAGEOUS
OPINIONS 3:30-5:00
Open Session Organizer: R. D. Schlichting, University of Arizona
Those interested in attending this session are encouraged to
contact
Professor Schlichting for further details at: rick@cs.arizona.edu
SESSION 3B, REAL-TIME SYSTEMS 3:30-5:00
1. How Hard is Hard Real-time Communication on Field-Buses?
P. Verissimo, University of Lisboa, Portugal J. Rufino, Technical
University of Lisboa and L. Ming, Ecole Polytechnique de Lausanne,
Switzerland
2. Experimental Evaluation of Failure-Detection Schemes in Real-time
Communication Networks
S. Han and K. G. Shin, University of Michigan, Ann Arbor, USA
3. Fault-Tolerance in Real-time Control Applications Using (m, k)-Firm
Guarantee
P. Ramanathan, University of Wisconsin, Madison, USA
Thursday, June 26th 1997
8:00 - 9:30 AM SESSIONS 4A & 4B
9:30 - 10:00 - Coffee
10:00 - 11:30 - SESSIONS 5A & 5B
11:30 - 12:15 - SESSION 6
12:15 - 1:30 PM Lunch
1:30 - 10:00 - Visits/Tours/Banquet
SESSION 4A, TEST GENERATION 8:00-9:30
1. ACTIV-LOCSTEP: A Test Generation Procedure Based on Logic
Simulation and Fault Activation
I. Pomeranz and S. M. Reddy, University of Iowa, Iowa City, USA
2. Robust Search Algorithms for Test Pattern Generation
J. P. M. Silva, INESC, Lisboa, Portugal and K. A. Sakallah, University
of
Michigan, Ann Arbor, USA
3. Optimal Structural Diagnosis of Wiring Networks
W. Shi, University of North Texas, Denton, USA and D. B. West, University
of Illinois at Urbana-Champaign, USA
SESSION 4B, SCHEDULING AND RECOVERY 8:00-9:30
1. A New Approach to Realizing Fault-Tolerant Multiprocessor
Scheduling by Exploiting Implicit Redundancy
K. Hashimoto, T. Tsuchiya, and T. Kikuno, Osaka University, Japan
2. Fault Recovery Mechanism for Multiprocessor Servers
Y. Masubuchi, S. Hoshina, T. Shimada, H. Hirayama, and N. Kato,
TOSHIBA
Corporation, Tokyo, Japan
3. An Object-Oriented Testbed for the Evaluation of Checkpointing
and
Recovery Systems
B. Ramamurthy and S. J. Upadhyaya, State University of New York,
Buffalo,
USA and R. K. Iyer, University of Illinois at Urbana-Champaign,
USA
SESSION 5A, EXPLOITING REDUNDANCY 10:00-11:30
1. Using Non-Volatile Storage to Improve the Reliability of RAID5
Disk
Arrays
R. Y. Hou and Y. N. Patt, University of Michigan, Ann Arbor, USA
2. Using Virtual Links for Reliable Information Retrieval Across
Point-to-Point Networks
F. J. Meyer, X.-T. Chen, W.-K. Huang, and F. Lombardi, Texas A&M
University, College Station, USA
3. On the Design of Constant Weight Codes for VLSI Systems
L. G. Tallini and B. Bose, Oregon State University, Corvallis, USA
SESSION 5B, FAULT INJECTION 10:00-11:30
1. On Finding an Optimal Combination of Error Detection Mechanisms
Based on Results of Fault Injection Experiments
Andreas Steininger and Christoph Scherrer, Vienna University of
Technology,
Austria
2. Dependability Analysis of a High-Speed Network
M.-C. Hsueh, D. Stott, G. Ries, and R. K. Iyer, University of Illinois
at
Urbana-Champaign, USA
3. Fault-Injection-Based Testing of Fault-Tolerant Algorithms in
Message-Passing Parallel Computers
D. M. Blough and T. Torii, University of California at Irvine, USA
SESSION 6, TOOL DEMONSTRATIONS 11:30-12:15
Model- and Experiment-based Evaluation: Part 1
Organizer: W. H. Sanders
University of Illinois at Urbana-Champaign, USA
Friday, June 27th, 1997
8:00 - 9:30 AM SESSION 7A & 7B
9:30 - 10:00 - Coffee
10:00 - 11:30 - SESSION 8A & 8B
11:30 - 12:15 - SESSION 9
12:15 - 1:30 PM Lunch
1:30 - 3:00 - SESSION 10A & 10B
3:00 - 4:00 - TC Meeting
SESSION 7A, ROBUST ALGORITHMS 8:00-9:30
1. Robust Emulation of Shared Memory Using Dynamic Quorum-
Acknowledged Broadcasts
N. Lynch and A. Shvartsman, Massachusetts Institute of Technology,
Cambridge, USA
2. Fail-Awareness: An Approach to Construct Fail-Safe Applications
C. Fetzer and F. Cristian, University of California at San Diego,
USA
3. A Lightweight Solution to Uniform Atomic Broadcast for Asynchronous
Systems
E. Anceaume, IRISA/CNRS, Rennes Cedex, France
SESSION 7B, EXPERIENCE WITH DISTRIBUTED SYSTEMS
8:00-9:30
1. Integrating Checkpointing with Transaction Processing
Y. M. Wang, AT&T Labs, Murray Hill, USA, P. Y. Chung, Y. Huang,
Lucent
Bell Labs, Murray Hill, USA, and E. N. Elnozahy, Carnegie Mellon
University, Pittsburgh, USA
2. Implementing the Swiss Electronic Stock Exchange System
R. Piantoni and C. Stanescu, Swiss Stock Exchange, Z=FCrich,
Switzerland
3. A Flexible Clustered Approach to High Availability
G. Hughes-Fenchel, Lucent Technologies, Naperville, USA
SESSION 8A, DESIGN FOR TESTING AND RELIABILITY
10:00-11:30
1. Partial Scan Beyond Cycle Cutting
G. S. Saund, M. S. Hsiao, J. H. Patel, University of Illinois, Urbana,
USA
2. Microarchitectural Synthesis of ICs with Embedded Concurrent
Fault
Isolation
S. N. Hamilton and A. Orailoglu, University of California at San
Diego, USA
3. COFTA: Hardware-Software Co-Synthesis of Heterogeneous Distributed
Embedded System Architectures for Low Overhead Fault Tolerance
B. P. Dave and N. K. Jha, Princeton University, Princeton, USA
SESSION 8B, YEAR 2000 CHALLENGE 10:00-11:30
Panel Organizer: R. Chillarege, IBM, T. J. Watson Research Center,
Yorktown Heights, USA
SESSION 9, TOOL DEMONSTRATION 11:30-12:15
Model- and Experiment-based Evaluation: Part 2
Organizer: H. Madeira, Universidade de Coimbra, Portugal
SESSION 10A, MODELING AND PREDICTION 1:30-3:00
1. Discriminating Fault Rate and Persistency to Improve Fault
Treatment
A. Bondavalli, CNUCE/CNR, Pisa, S. Chiaradonna, F. Di Giandomenico,
and F.
Grandoni, IEI/CNR, Pisa, Italy
2. A General Model for Reliability Maximization Problem Under Given
Redundancy
S. Morinaga, NEC Corporation, Kawasaki, Japan
3. Predicting Physical Processes in the Presence of Faulty Sensor
Readings
M. Clegg and K. Marzullo, University of California at San Diego,
USA
SESSION 10B, EXPERIENCE WITH LARGE-SCALE SYSTEMS
1:30-3:00
1. Experimental Evaluation of Computer-Based Railway Control Systems
A. M. Amendola, L. Impagliazzo, P. Marmo, and F. Poli Ansaldo-Cris,
Napoli,
Italy
2. Reliability-Oriented Design of a Distributed Control System for
High-Voltage Switchgear Stations
S. Draber, ABB Corporate Research, Baden, Switzerland
3. Redundancy Management Software Services for Seawolf Ship Control
System
J. T. Sims, Draper Laboratory, Cambridge, USA
|
| |
Tutorials
1. Fault Tolerance in Wolfpack NT Clusters
Joe Barrera, Jim Gray, Microsoft
2. Availability in the Real World
Alan Wood, Tandem Computers
3. Commercial F T Computing as Currently Practiced by Stratus
Rick Harper, Stratus
4. Building Secure and Reliable Applications
Kenneth Birman, Cornell University
Tutorial 1: 8:00 AM - 12:00 Noon
A WolfPack Tutorial: Commoditizing High Availability
Joe Barrera, Jim Gray, and the NT WolfPack Clusters team.
A consortium of sixty hardware and software vendors has defined
an
application programming interface and architecture to commoditize
highly
available computing. Inspired by Greg Pfister's book, In Search
of
Clusters, the project is code-named Wolfpack. Later versions of
WolfPack
address cluster scalability.
This tutorial will outline the Wolfpack goals. It will then
explain the architecture and architectural concepts, both abstractly
and
with demonstrations using a "portable" fault tolerant
cluster. Wolfpack
can improve the availability of any application by giving it automatic
restart and process mobility (improved MTTR). The tutorial will
explore
how applications can go beyond this simple benefit: making it easier
to
install and manage them. The tutorial explains how file servers,
web-servers, and database servers become Wolfpack aware. These concepts
are demonstrated on a working cluster. The tutorial concludes with
a
discussion of some key algorithms underlying the cluster: quorum,
global
update, time, and IP-address failover.
Biography
Dr. Gray is a specialist in database and transaction processing
computer systems. At Microsoft his research focuses on scaleable
computing:
building super-servers and workgroup systems from commodity software
and
hardware. He is editor of the Performance Handbook for Database
and
Transaction Processing Systems, and co-author of Transaction Processing
Concepts and Techniques. He holds doctorates from Berkeley and Stuttgart,
is a Member of the National Academy of Engineering, Fellow of the
ACM,
member of the National Research Council's Computer Science and
Telecommunications Board, Editor in Chief of the VLDB Journal, Trustee
of
the VLDB Foundation, and Editor of the Morgan Kaufmann series on
Data
Management.
URL: http://research.microsoft.com/~gray
URL: http://research.microsoft.com/~joebar/
Tutorial 2: 8:00 AM - 12:00 Noon
Availability in the Real World
Alan Wood, Tandem Computers, Inc.
Classical models of computer system availability generally consist
of detailed hardware failure models. Unfortunately, these classical
availability models are no longer relevant since software, operations,
and
environmental failures are all at least as important as hardware
failures.
Availability models are increasingly complex because they must account
for
the myriad of computer system configurations and the many failure
modes of
client, server, and network devices.
In this tutorial, we will briefly survey classical definitions and
techniques such as Markov models. We will introduce outage categorizations,
new definitions of failure, and new metrics such as user outage
minutes.
Methods for minimizing outages in each of the outage categories
will be
discussed. A large sample of real-world outage data will be shown
to
demonstrate the wide variety of failures that can occur in a complex
computing environment. A simple spreadsheet model that incorporates
most of
the failure modes gleaned from the data will be presented. We will
describe
how we used this model to help drive availability improvement in
Tandem
systems.
Biography
Dr. Alan Wood is a Senior Engineer in the Reliability Engineering
Department at Tandem Computers, Inc. He has 20 years experience
in
reliability engineering for nuclear energy, defense, and commercial
environments. He has a Ph.D. in Operations Research from Stanford
University and has published over 25 papers in the areas of reliability
and
availability. His current primary research interests are software
reliability and computer system availability.
Tutorial 3: 1:00 PM - 5:00 PM
Commercial fault tolerant computing as currently practiced by Stratus
and
its customers
Coordinated by Rick Harper, Stratus Computer.
This tutorial is intended to provide a complete picture of the
challenges involved in successfully developing and operating a fault
tolerant computer for a mission-critical commercial application.
The
tutorial will provide a detailed technical description of the hardware
architecture of a current Stratus system. Various technical issues
including design verification and testing, price/performance, error
recovery, and component reintegration will be discussed. A technical
description of Stratus' VOS operating system will also be presented,
including features specific to supporting hardware fault tolerance,
internal robustness and recovery features, and online upgrade issues.
The
tutorial will also describe techniques that can be used to improve
the
robustness of operating system and application software. Topics
to be
covered include a discussion of the effects of hardware and software
faults
and errors; software techniques for tolerating errors, and techniques
for
developing reliable code.
Biography
Dr. Rick Harper is a Senior Technical Consultant at Stratus
Computer. Prior to arriving at Stratus in 1995, he spent 12 years
at The
Charles Stark Draper Laboratory, where he worked in the design,
development, and implementation of fault tolerant computing, communication,
and software systems. He has authored numerous technical papers,
holds
several patents, and has supervised over 20 advanced degrees in
the area of
fault tolerant computing. He obtained his Ph.D. in the area of fault
tolerant computing from MIT in 1987.
Tutorial 4: 1:00 PM - 5:00 PM
Building Secure and Reliable Network Applications
Professor Kenneth P. Birman, Cornell University
This tutorial will look at security and reliability limitations
of
modern network technologies, including client-server architectures,
the
Web, CORBA, DCOM/Active-X, and Java. We will review the major security
and
reliability options available to developers of critical systems,
including
transactions, process group computing, and key-based security. Finally,
we
will discuss some of the emerging technology prospects for critical
applications, such as virtual private networks and IP-v6 quality
of service
guarantees. Because of the short length of the tutorial, the treatment
will be fairly fast-paced, but not overly detailed.
The tutorial draws on material from a recent book, copies of which
can be ordered from Prentice Hall, or ordered online from
http://www.browsebooks.com/Birman/index.html.
Biography
Professor Birman heads the Horus and Ensemble research efforts at
Cornell University. He is also Editor in Chief of ACM Transactions
on
Computer Systems, and author of many publications in this area.
Professor
Birman was founder of Isis Distributed Systems, Inc., which is now
a
division of Stratus Computer. He has played a role in developing
software
for several stock markets, an air-traffic control system, VLSI
chip-fabrication process control systems, military systems (including
technology for use in future naval AEGIS systems) and other critical
applications.
|