Delphos: Dependability Evaluation of COTS DBMS

Diamantino Costa, Henrique Madeira , João Gabriel Silva
Universidade de Coimbra – Dep. Eng. Informática
Portugal

Delphos is an ongoing research project aiming at study and evaluate the fault-tolerant techniques available in commercial off-the-shelf (COTS) Database Management Systems (DBMS) through fault-injection[1].

DBMS have traditionally been a field with fault-tolerance needs - data integrity and high availability being cornerstone requirements – with the database community driving research advances in several fault-tolerant techniques (transactions, two-phase commit protocols, data replication). Still, it is generally accepted that evaluation/validation of those techniques has been lacking the knowledge about failure modes which would come out of experimental fault-injection [2]. The increasing trend for using COTS DBMS technology on mission-critical systems pushes even further the interest for dependability evaluation of those systems. Indeed, those interested in providing reliability wrappers will need to gather deep knowledge on COTS DBMS faulty behavior. Furthermore, database and OS vendors have been delivering cluster and recovery software add-on components that have introduced new dimensions for the benchmarking activity. As stated in [3] "How useful is the fact that a system can perform a zillion tpm or execute a query in milliseconds without some knowledge about its availability and the cost of achieving the appropriate level of performance and availability?

Delphos goals include both establishing a methodology for dependability benchmarking of COTS DBMS as well as gaining insight into the impact of faults in database applications. The latter will provide a basis for drawing defensive programming methodologies that take cost-effective advantage of built-in fault-tolerant techniques available on DBMS, such as transactions and checkpoints. The former goal pretends to be a start up initiative for achieving standard TPC (Transaction Processing Performance Council) like benchmarks that evaluate data integrity and availability aspects of COTS DBMS. We are aware that portability of fault-injection tools is a major hurdle concerning fault-injection based evaluation. So, we do not claim that a fault-injection tool like the one present in this framework will be the solution for supporting TPC like dependability benchmarks. However, we believe that a comprehensive fault-injection study based on fault field data is perhaps the best approach to select a high-level fault/error model suitable to be used on those benchmarks.

The following sections will elaborate on main design decisions and tradeoffs made in conceiving Delphos experimental framework and, where appropriate, the current prototype configuration is also addressed (see Figure 1). The presentation is breakdown in four dimensions: Target System, Faults, Benchmarks and Results.

  • The target system should be representative of COTS DBMS so it was early decided to adhere to strict commercial products. For the best of our knowledge, no other experimental fault-injection study had already addressed a COTS DBMS. In our current setup the database server is Oracle 7.3 Enterprise Edition running on top of Microsoft WindowsNT 4.0 Server. The server hardware platform is an Intel 200MHz PentiumPRO based PC with 128MB RAM. The system architecture is a typical Client/Server configuration with SQL*Net over TCP/IP middleware. Client load is driven by a TPC compliant Remote Terminal Emulator (RTE) running on a 150 MHz Pentium PC with Windows Worksation 4.0.
  • Faults – Several studies have pointed out the increasing responsibility of software on DBMS failures, particularly on fault-tolerant machines [Sullivan]. However, in presence of COTS hardware it is expected that hardware faults still take a significant share of system failures. The faults set will then include both transient hardware faults and software faults (i.e., bugs). Faults are injected in the database processes/threads only (no attempt to mimic client or network faults is tried). Despite of being fairly easy to inject faults in middleware and client components we are more concerned with server faults since they represent the biggest impairment to system dependability [4]. The fault-injection tool is a SWIFI tool – XceptionNT[5], which takes advantage of Pentium processor debugging support to perform minimal intrusion fault injection. The first bunch of experiments will focus on hardware faults only but the adaptation of XceptionNT to inject software faults as suggested in [6] and [7] is underway.
  • Benchmarks – From the suite of DBMS applications, OLTP stands clearly as the most demanding in terms of dependability; since TPC-C is a de facto standard for performance evaluation of OLTP systems it was a natural choice. The TPC-C standard setup has been further augmented with a new entity – the Database Administrator – and several other enhancements were made to cope with the presence of new inputs – faults and failures, namely. The proposed marriage of a performance benchmark with a fault injection experiment will enables straightforward evaluation of tradeoffs between performance and dependability – just what most COTS DBMS purchasers would like to have. For the sake of simplicity we have selected the former TPC-A[8] for the first testbed prototype (TPC-C implementation is underway and will constitute the benchmark of the end prototype).
  • Results – We are naturally interested on acquiring figures on the impact of faults in terms of data integrity and availability. Data integrity is addressed at three levels: 1) At application level, using the set of semantic rules and consistency tests already specified for TPC benchmarks. 2) At referential integrity level, matching the relational database rules (data dictionary) with the data actually stored. 3) At file level, database data files and log files are checked for the integrity of its internal data structures. Availability is accessed through 1) database mean time to recovery as seen from the end user, 2) from the type of database recovery needed (partial or total) and 3) recovery effort - automatic or database administrator assisted (manual). These elaborated results (and many others) are made possible collecting data at different levels, going from the machine level, to the OS, to the DBMS to the application level. The evaluation environment offers machine level readouts including precise location of the fault injected. At the OS level, exceptions, crash and hang conditions and return codes of database processes are also logged. At the application level TPC-A activity is logged in order to pinpoint delayed and lost transactions.

    Testbed Layout

    Figure 1 - Testebed Layout

We have briefly introduced Delphos – a research project for experimental evaluation of dependability figures of COTS DBMS. The experimantal framework combines a minimum-intrusiveness fault-injection tool with standard benchmarks, a COTS OS and platform to deliver the most realistic results as possible. The first prototype, which consists of Oracle 7.3 Enterprise Server on top of a Wintel platform, is now entering testing and we will have preliminary results in a near future.

References

[1] J.Arlat, "Fault injection for the experimental validation of fault-tolerant systems", IEICE Workshop Fault-Tolerant Systems, Kyoto (Japon), 18-19 June 1992, pp.33-40.

[2] M. T. Ozsu and P. Valduriez, "Distributed Databases Systems: Where are we now?," IEEE Computer, vol. 24, 1991.

[3] D. Brock, "A Recommendation for High-Availability Options in TPC Benchmarks ", Data General, http://www.tpc.org/articles/HA.html

[4] A. Wood, "Predicting Client/Server Availability," IEEE Computer, vol. 28, pp. 41-48, 1995.

[5] J.Carreira, H.Madeira, J.Silva, "Xception: A technique for the evaluation of dependability in modern Computers", IEEE Transactions on Software Engineering, Vol.24, No.2, pp. 125-136, February 1998.

[6] J. Christmansson and R. Chillarege, "Generation of an Error Set that Emulates Software Faults Based on Field Data," presented at FTCS'26, Sendai, Japan, 1996, pp. 304-313.

[7] M. P. Sullivan, "System Support for Software Fault Tolerance in Highly Available Database Management Systems," Ph.D. Thesis, 1992.

[8] TPC, "TPC Benchmark A, Standard Specification, Revision 2.0," Transaction Processing Performance Council 7, June 1994.


Research project supported by Fundação para a Ciência e Tecnologia - PRAXIS XXI under grant number 2/2.1/TIT/1570/95.

Research supported by Fundação para a Ciência e Tecnologia - PRAXIS XXI under grant number BD/5636/95.

Authors contact: Dependable Systems Group, Dep. de Eng. Informática, University of Coimbra., Polo II, P-3030 Portugal. Phone: 351-39-790000. Fax:351-39-701266. E-Mail: {dino, henrique, jgabriel}@dei.uc.pt. URL: http://dsg.dei.uc.pt