The Pygmalion Effect in Experimental Dependability Evaluation

João Carreira, Sara Carvalho, João Gabriel Silva 1 
Department of Computer Science, University of Coimbra, Portugal

Dependability Evaluation The evaluation of the dependability properties of a computer system is a complex task and the growing complexity of both the hardware and software tend to make this evaluation even more difficult. The use of analytical modeling in ac-tual systems is very difficult as the mechanisms involved in the fault activation and in the error propagation process are highly complex and are not completely understood in most of the cases. Experimental evaluation techniques, based on fault injection have become over the years an attractive alternative.  Using this methodology, faults with certain characteristics (which are supposed to be as close as possible to real faults), are injected in the target system and their impact measured. These faults can be injected using several mecha-nisms, such as pin-level injection, power supply disturbances, heavy-ion radiation, simulation, and more recently Soft-ware Implemented Fault Injection (SWIFI) – a fair spectrum of quite different alternatives. Moreover, these faults are activated either after a specified amount of time or after a certain target system measurable event, such as processor ac-cesses to memory, number of instructions executed, pin-level signals, and overall system load among others. Still, the level at which faults are injected also can vary a lot: depending on the techniques, faults can be injected at the gate level, pin-level, processor register level, or elsewhere (sometimes it is difficult to know what is really being affected). Another characteristic of fault injection experiments is that is rare to find two studies using the same target system (and proces-sor): from the old MC68000 to the 21 million Pentium Pro there’s a handful of results and conclusions. Concerning the software running in these systems (the payload) it can be as diverse as parallel scientific applications, control, transport and aerospatial applications and systems, among many others. Last but not least, evaluation experiments are usually made with some clear research goals in mind, e.g. test specific error-detection mechanism or exception handlers, vali-date specific fault handling mechanisms and allow the estimation of fault tolerant system measures such as fault cover-age and error latency. 

Pygmalion It is clear that all the factors mentioned above and the large spectrum of alternatives strongly influence the way faults are defined, experiments are conducted, and results collected and analysed. So far so good. But what does Pygmalion has to do with dependability evaluation after all ?  Back in the sixties, two American psychologists - Robert Rosenthal and Leonore Jacobson - published a book entitled “Pygmalion in the Classroom” with the results of several interesting experiments. What they attempted to demonstrate was that generally in social life and particularly in education whenever there is someone evaluating something, the evaluator is hardly neutral and its expectations concerning the evaluated object (object is left purposedly vague here) influence the evaluation aiming at proofing the initial hipothesis. In their first experiments Rosenthal and Jacobson used mice. For this purpose, they were aided by two groups of collaborators (A and B). The psychologists told both groups that the experiment was aimed at testing mice intelligence. The mice should be submitted to some tests and the results annotated. Rosenthal gave a group of mice randomly selected from the laboratory hatcher to each group. However, Rosenthal told group A that their mice were genetically selected and exceptionally smart (favoring positive expectations) while to group B he told that the mice suffered from hereditary degeneracy (favoring negative expectations). As surpris-ing at it may seems, the results showed that the mice of group A performed much better than those of group B. This example show that the expectations of the evaluators can strongly bias the outcome of the evaluation and that most of the times the evaluators do not perceive the existence of such expectation in themselves, and its enormous power. Rosenthal conducted a similar experience in a classroom with students and professors (omitting the group’s B sce-nario for ethical reasons) and reached the same conclusions. When professors expected the students to be more intelli-gent, the qualifications actually achieved by the students corresponded to the teacher expectations. This unconscient be-havior that tends to bias the evaluators towards their inner expectations is known as the Pygmalion effect. As a short historical note, Pygmalion was a famous sculptor of the greek island of Crete that fell in love by a statue he made of Ga-lateia, like if it was a living person, and which as a consequence came into life (with a little help of Venus).

How it works One can say that virtually every evaluation process where there’s human intervention is prone to the Pygmalion ef-fect, Computer Science is no exception. While in social sciences the Pygmalion effect is the subject of intensive studies aiming at reducing its negative impact in the society, in dependability evaluation Pygmalion seems to rule. Just give the evaluator a target system and payload, a myriad of possible fault types, a handful of techniques, and a clear research goal, and Pygmalion will do the rest. This may seem only sarcasm (and is, a bit) but the problem is real. We’re going to show how Pygmalion works with some practical examples: When conducting dependability evaluation experiments, the framing context of the evaluator is a quite important factor. It influences the selection of the target system, fault model, benchmarks, and the categorization of the outcome, among others.  Let’s consider our own group’s Xception SWIFI tool (with the risk of shooting our own feet). Xception is able to emulate faults in processor (PowerPC) functional units: address bus, data bus, floating-point unit (FPU), integer unit (IU), etc. Let’s now pick a bunch of fault injection results obtained with Xception for some benchmark applications (us-ing single-bit flip faults), and suppose one wants to evaluate the coverage of a processor built-in error detection mecha-nism - the PowerPC data access exception.  From the results obtained for the Matmult application, 21% of data bus faults were caught by this EDM. On the other hand, this mechanism only caught 3,1% of the IU faults. The average for all functional units is 10,5%. Furthermore, if one used the SOR (Successive Over-Relaxation) benchmark, Pygmalion could say that the EDM caught either 5% (if data bus faults were considered) or 10% (if IU faults were considered).  There are two important issues here: Fault Model and Benchmarks. First, although all faults are generally classified as “SWIFI single-bit faults” (corresponding to single bit-flip register/memory manipulations), there are immense differ-ences in their impact. Note that this happens within a single SWIFI tool. What about faults injected by other tools also injecting “SWIFI single-bit faults”? Is there any relation at all between their results and these?  Second, the differences obtained for two mathematical toy benchmarks show how significant is their impact on the results. With other dissimilar applications, found in control, transportation, and aerospace systems, differences may be even higher. To summarize, there’s plenty of space for Pygmalion to play Still other common cases: What is the frequency considered for intermittent faults? 5 ms or 100 ns? And what hap-pens when, during fault injection experiments, faults are discarded, e.g. in the processor pre-fetch queue? Are they con-sidered for the final fault injection figures, or just deleted? Of course these figures, in the form of tables and charts, are quite important. But how is the experiment outcome classified in practice? One possibility is to use four classes: De-tected, Undetected Wrong Result, Undetected Correct Result, and System Crash (as Xception does). However one can also gather System Crash with Detected, and Undetected Correct Faults can encompass discarded faults, or even not in-jected faults in a large bag. Pygmalion usually has an answer to most of these questions. In the end, the results will match the evaluator’s inner expectations.

Any Clues? Note that this problem is not particular of dependability evaluation. There are also other areas in computer science where the Pygmalion effect can play a role such as performance evaluation. However, the definition of benchmarks in these fields go precisely in the way towards diminishing it, e.g. the Spec benchmark for measuring raw processor per-formance, TPC for transactional database systems, and Splash for shared memory parallel programs. Benchmarks for dependability evaluation are not yet a reality (will they ever be? given the vast spectrum of goals of dependability stud-ies). Nonetheless, benchmarks provide a systematic and automated process for evaluation in order to enable fare and unbiased comparisons. The Pygmalion effect can show-up later in the analysis and interpretation of results - a task left to humans. Of course, systems can also be developed to perform well with specific benchmarks in the first place (something quite different from the unconsciousness of the Pygmalion effect). But even supposing that there’s no malicious driving of results, can one confidently guarantee that a well-defined and rigorous process totally eliminates the Pygmalion effect? In this short abstract we introduced someone that has been playing around within our research community - Pygmalion. It is our responsibility to recognize his existence, find where and how he’s influencing us negatively and do something about it. Any clues ?


1. Author contact: Departamento de Engenharia Infomática, Universidade de Coimbra. Phone: 351.39.790033. Fax: 351.39.701266. E-Mail: jcar@dei.uc.pt