Delphos: Dependability
Evaluation of COTS DBMS
Diamantino Costa,
Henrique Madeira , João Gabriel Silva
Universidade de Coimbra Dep. Eng. Informática
Portugal
Delphos is an ongoing research project aiming
at study and evaluate the fault-tolerant techniques available in commercial
off-the-shelf (COTS) Database Management Systems (DBMS) through fault-injection[1].
DBMS have traditionally been a field with fault-tolerance needs - data
integrity and high availability being cornerstone requirements with
the database community driving research advances in several fault-tolerant
techniques (transactions, two-phase commit protocols, data replication).
Still, it is generally accepted that evaluation/validation of those techniques
has been lacking the knowledge about failure modes which would come out
of experimental fault-injection [2]. The increasing trend
for using COTS DBMS technology on mission-critical systems pushes even further
the interest for dependability evaluation of those systems. Indeed, those
interested in providing reliability wrappers will need to gather deep knowledge
on COTS DBMS faulty behavior. Furthermore, database and OS vendors have
been delivering cluster and recovery software add-on components that have
introduced new dimensions for the benchmarking activity. As stated in [3] "How useful is the fact that a system can perform
a zillion tpm or execute a query in milliseconds without some knowledge
about its availability and the cost of achieving the appropriate level of
performance and availability?
Delphos goals include both establishing a methodology for dependability
benchmarking of COTS DBMS as well as gaining insight into the impact of
faults in database applications. The latter will provide a basis for drawing
defensive programming methodologies that take cost-effective advantage of
built-in fault-tolerant techniques available on DBMS, such as transactions
and checkpoints. The former goal pretends to be a start up initiative for
achieving standard TPC (Transaction Processing Performance Council) like
benchmarks that evaluate data integrity and availability aspects of COTS
DBMS. We are aware that portability of fault-injection tools is a major
hurdle concerning fault-injection based evaluation. So, we do not claim
that a fault-injection tool like the one present in this framework will
be the solution for supporting TPC like dependability benchmarks. However,
we believe that a comprehensive fault-injection study based on fault field
data is perhaps the best approach to select a high-level fault/error model
suitable to be used on those benchmarks.
The following sections will elaborate on main design decisions and tradeoffs
made in conceiving Delphos experimental framework and, where appropriate,
the current prototype configuration is also addressed (see Figure
1). The presentation is breakdown in four dimensions: Target
System, Faults, Benchmarks
and Results.
- The target system should be representative
of COTS DBMS so it was early decided to adhere to strict commercial products.
For the best of our knowledge, no other experimental fault-injection study
had already addressed a COTS DBMS. In our current setup the database server
is Oracle 7.3 Enterprise Edition running on top of Microsoft WindowsNT
4.0 Server. The server hardware platform is an Intel 200MHz PentiumPRO
based PC with 128MB RAM. The system architecture is a typical Client/Server
configuration with SQL*Net over TCP/IP middleware. Client load is driven
by a TPC compliant Remote Terminal Emulator (RTE) running on a 150 MHz
Pentium PC with Windows Worksation 4.0.
- Faults Several studies have pointed
out the increasing responsibility of software on DBMS failures, particularly
on fault-tolerant machines [Sullivan]. However, in presence of COTS hardware
it is expected that hardware faults still take a significant share of system
failures. The faults set will then include both transient hardware faults
and software faults (i.e., bugs). Faults are injected in the database processes/threads
only (no attempt to mimic client or network faults is tried). Despite of
being fairly easy to inject faults in middleware and client components
we are more concerned with server faults since they represent the biggest
impairment to system dependability [4]. The fault-injection
tool is a SWIFI tool XceptionNT[5], which takes
advantage of Pentium processor debugging support to perform minimal intrusion
fault injection. The first bunch of experiments will focus on hardware
faults only but the adaptation of XceptionNT to inject software faults
as suggested in [6] and [7] is underway.
- Benchmarks From the suite of DBMS
applications, OLTP stands clearly as the most demanding in terms of dependability;
since TPC-C is a de facto standard for performance evaluation of OLTP systems
it was a natural choice. The TPC-C standard setup has been further augmented
with a new entity the Database Administrator and several
other enhancements were made to cope with the presence of new inputs
faults and failures, namely. The proposed marriage of a performance benchmark
with a fault injection experiment will enables straightforward evaluation
of tradeoffs between performance and dependability just what most
COTS DBMS purchasers would like to have. For the sake of simplicity we
have selected the former TPC-A[8] for the first testbed
prototype (TPC-C implementation is underway and will constitute the benchmark
of the end prototype).
- Results We are naturally interested
on acquiring figures on the impact of faults in terms of data integrity
and availability. Data integrity is addressed at three levels: 1) At application
level, using the set of semantic rules and consistency tests already specified
for TPC benchmarks. 2) At referential integrity level, matching the relational
database rules (data dictionary) with the data actually stored. 3) At file
level, database data files and log files are checked for the integrity
of its internal data structures. Availability is accessed through 1) database
mean time to recovery as seen from the end user, 2) from the type of database
recovery needed (partial or total) and 3) recovery effort - automatic or
database administrator assisted (manual). These elaborated results (and
many others) are made possible collecting data at different levels, going
from the machine level, to the OS, to the DBMS to the application level.
The evaluation environment offers machine level readouts including precise
location of the fault injected. At the OS level, exceptions, crash and
hang conditions and return codes of database processes are also logged.
At the application level TPC-A activity is logged in order to pinpoint
delayed and lost transactions.
Figure 1 - Testebed Layout
We have briefly introduced Delphos a research project for experimental
evaluation of dependability figures of COTS DBMS. The experimantal framework
combines a minimum-intrusiveness fault-injection tool with standard benchmarks,
a COTS OS and platform to deliver the most realistic results as possible.
The first prototype, which consists of Oracle 7.3 Enterprise Server on top
of a Wintel platform, is now entering testing and we will have preliminary
results in a near future.
References
[1] J.Arlat, "Fault injection for
the experimental validation of fault-tolerant systems", IEICE Workshop
Fault-Tolerant Systems, Kyoto (Japon), 18-19 June 1992, pp.33-40.
[2] M. T. Ozsu and P. Valduriez, "Distributed
Databases Systems: Where are we now?," IEEE Computer, vol. 24, 1991.
[3] D. Brock, "A Recommendation for
High-Availability Options in TPC Benchmarks ", Data General, http://www.tpc.org/articles/HA.html
[4] A. Wood, "Predicting Client/Server
Availability," IEEE Computer, vol. 28, pp. 41-48, 1995.
[5] J.Carreira, H.Madeira, J.Silva, "Xception:
A technique for the evaluation of dependability in modern Computers",
IEEE Transactions on Software Engineering, Vol.24, No.2, pp. 125-136, February
1998.
[6] J. Christmansson and R. Chillarege,
"Generation of an Error Set that Emulates Software Faults Based on
Field Data," presented at FTCS'26, Sendai, Japan, 1996, pp. 304-313.
[7] M. P. Sullivan, "System Support
for Software Fault Tolerance in Highly Available Database Management Systems,"
Ph.D. Thesis, 1992.
[8] TPC, "TPC Benchmark A, Standard
Specification, Revision 2.0," Transaction Processing Performance Council
7, June 1994.
Research project supported by Fundação
para a Ciência e Tecnologia - PRAXIS XXI under grant number 2/2.1/TIT/1570/95.
Research supported by Fundação para
a Ciência e Tecnologia - PRAXIS XXI under grant number BD/5636/95.
Authors contact: Dependable Systems Group, Dep. de Eng.
Informática, University of Coimbra., Polo II, P-3030 Portugal. Phone:
351-39-790000. Fax:351-39-701266. E-Mail: {dino, henrique, jgabriel}@dei.uc.pt.
URL: http://dsg.dei.uc.pt |