The Pygmalion Effect in Experimental Dependability Evaluation
- João Carreira, Sara Carvalho, João Gabriel Silva 1
- Department of Computer Science, University of Coimbra, Portugal
Dependability Evaluation The evaluation of
the dependability properties of a computer system is a complex task and
the growing complexity of both the hardware and software tend to make this
evaluation even more difficult. The use of analytical modeling in ac-tual
systems is very difficult as the mechanisms involved in the fault activation
and in the error propagation process are highly complex and are not completely
understood in most of the cases. Experimental evaluation techniques, based
on fault injection have become over the years an attractive alternative.
Using this methodology, faults with certain characteristics (which are supposed
to be as close as possible to real faults), are injected in the target system
and their impact measured. These faults can be injected using several mecha-nisms,
such as pin-level injection, power supply disturbances, heavy-ion radiation,
simulation, and more recently Soft-ware Implemented Fault Injection (SWIFI)
a fair spectrum of quite different alternatives. Moreover, these
faults are activated either after a specified amount of time or after a
certain target system measurable event, such as processor ac-cesses to memory,
number of instructions executed, pin-level signals, and overall system load
among others. Still, the level at which faults are injected also can vary
a lot: depending on the techniques, faults can be injected at the gate level,
pin-level, processor register level, or elsewhere (sometimes it is difficult
to know what is really being affected). Another characteristic of fault
injection experiments is that is rare to find two studies using the same
target system (and proces-sor): from the old MC68000 to the 21 million Pentium
Pro theres a handful of results and conclusions. Concerning the software
running in these systems (the payload) it can be as diverse as parallel
scientific applications, control, transport and aerospatial applications
and systems, among many others. Last but not least, evaluation experiments
are usually made with some clear research goals in mind, e.g. test specific
error-detection mechanism or exception handlers, vali-date specific fault
handling mechanisms and allow the estimation of fault tolerant system measures
such as fault cover-age and error latency. Pygmalion
It is clear that all the factors mentioned above and the large spectrum
of alternatives strongly influence the way faults are defined, experiments
are conducted, and results collected and analysed. So far so good. But what
does Pygmalion has to do with dependability evaluation after all ?
Back in the sixties, two American psychologists - Robert Rosenthal and Leonore
Jacobson - published a book entitled Pygmalion in the Classroom
with the results of several interesting experiments. What they attempted
to demonstrate was that generally in social life and particularly in education
whenever there is someone evaluating something, the evaluator is hardly
neutral and its expectations concerning the evaluated object (object is
left purposedly vague here) influence the evaluation aiming at proofing
the initial hipothesis. In their first experiments Rosenthal and Jacobson
used mice. For this purpose, they were aided by two groups of collaborators
(A and B). The psychologists told both groups that the experiment was aimed
at testing mice intelligence. The mice should be submitted to some tests
and the results annotated. Rosenthal gave a group of mice randomly selected
from the laboratory hatcher to each group. However, Rosenthal told group
A that their mice were genetically selected and exceptionally smart (favoring
positive expectations) while to group B he told that the mice suffered from
hereditary degeneracy (favoring negative expectations). As surpris-ing at
it may seems, the results showed that the mice of group A performed much
better than those of group B. This example show that the expectations of
the evaluators can strongly bias the outcome of the evaluation and that
most of the times the evaluators do not perceive the existence of such expectation
in themselves, and its enormous power. Rosenthal conducted a similar experience
in a classroom with students and professors (omitting the groups B
sce-nario for ethical reasons) and reached the same conclusions. When professors
expected the students to be more intelli-gent, the qualifications actually
achieved by the students corresponded to the teacher expectations. This
unconscient be-havior that tends to bias the evaluators towards their inner
expectations is known as the Pygmalion effect. As a short historical note,
Pygmalion was a famous sculptor of the greek island of Crete that fell in
love by a statue he made of Ga-lateia, like if it was a living person, and
which as a consequence came into life (with a little help of Venus). How
it works One can say that virtually every evaluation process where theres
human intervention is prone to the Pygmalion ef-fect, Computer Science is
no exception. While in social sciences the Pygmalion effect is the subject
of intensive studies aiming at reducing its negative impact in the society,
in dependability evaluation Pygmalion seems to rule. Just give the evaluator
a target system and payload, a myriad of possible fault types, a handful
of techniques, and a clear research goal, and Pygmalion will do the rest.
This may seem only sarcasm (and is, a bit) but the problem is real. Were
going to show how Pygmalion works with some practical examples: When conducting
dependability evaluation experiments, the framing context of the evaluator
is a quite important factor. It influences the selection of the target system,
fault model, benchmarks, and the categorization of the outcome, among others.
Lets consider our own groups Xception SWIFI tool (with the risk
of shooting our own feet). Xception is able to emulate faults in processor
(PowerPC) functional units: address bus, data bus, floating-point unit (FPU),
integer unit (IU), etc. Lets now pick a bunch of fault injection results
obtained with Xception for some benchmark applications (us-ing single-bit
flip faults), and suppose one wants to evaluate the coverage of a processor
built-in error detection mecha-nism - the PowerPC data access exception.
From the results obtained for the Matmult application, 21% of data bus faults
were caught by this EDM. On the other hand, this mechanism only caught 3,1%
of the IU faults. The average for all functional units is 10,5%. Furthermore,
if one used the SOR (Successive Over-Relaxation) benchmark, Pygmalion could
say that the EDM caught either 5% (if data bus faults were considered) or
10% (if IU faults were considered). There are two important issues
here: Fault Model and Benchmarks. First, although all faults are generally
classified as SWIFI single-bit faults (corresponding to single
bit-flip register/memory manipulations), there are immense differ-ences
in their impact. Note that this happens within a single SWIFI tool. What
about faults injected by other tools also injecting SWIFI single-bit
faults? Is there any relation at all between their results and these?
Second, the differences obtained for two mathematical toy benchmarks show
how significant is their impact on the results. With other dissimilar applications,
found in control, transportation, and aerospace systems, differences may
be even higher. To summarize, theres plenty of space for Pygmalion
to play Still other common cases: What is the frequency considered for intermittent
faults? 5 ms or 100 ns? And what hap-pens when, during fault injection experiments,
faults are discarded, e.g. in the processor pre-fetch queue? Are they con-sidered
for the final fault injection figures, or just deleted? Of course these
figures, in the form of tables and charts, are quite important. But how
is the experiment outcome classified in practice? One possibility is to
use four classes: De-tected, Undetected Wrong Result, Undetected Correct
Result, and System Crash (as Xception does). However one can also gather
System Crash with Detected, and Undetected Correct Faults can encompass
discarded faults, or even not in-jected faults in a large bag. Pygmalion
usually has an answer to most of these questions. In the end, the results
will match the evaluators inner expectations. Any
Clues? Note that this problem is not particular of dependability evaluation.
There are also other areas in computer science where the Pygmalion effect
can play a role such as performance evaluation. However, the definition
of benchmarks in these fields go precisely in the way towards diminishing
it, e.g. the Spec benchmark for measuring raw processor per-formance, TPC
for transactional database systems, and Splash for shared memory parallel
programs. Benchmarks for dependability evaluation are not yet a reality
(will they ever be? given the vast spectrum of goals of dependability stud-ies).
Nonetheless, benchmarks provide a systematic and automated process for evaluation
in order to enable fare and unbiased comparisons. The Pygmalion effect can
show-up later in the analysis and interpretation of results - a task left
to humans. Of course, systems can also be developed to perform well with
specific benchmarks in the first place (something quite different from the
unconsciousness of the Pygmalion effect). But even supposing that theres
no malicious driving of results, can one confidently guarantee that a well-defined
and rigorous process totally eliminates the Pygmalion effect? In this short
abstract we introduced someone that has been playing around within our research
community - Pygmalion. It is our responsibility to recognize his existence,
find where and how hes influencing us negatively and do something
about it. Any clues ?
|