Enhancing Survivability of Critical Information Systems
John C. Knight, Kevin Sullivan, Xing Du,
Chenxi Wang, Matt Elder, Ray W. Lubinsky
Department of Computer Science, University of Virginia
Charlottesville, VA 22903-2442
{knight, sullivan, xd2a, cw2e, mce7e, rwl}@cs.virginia.edu
John McHugh
Department of Computer Science, Portland State University
Portland, OR 97201
mchugh@cs.pdx.edu
Many large information
systems have evolved to a point where organizations rely heavily
upon them. In some cases, such systems are so widespread and so
important that the normal activities of society depend upon their
continued operations. We refer to these systems as critical information
systems. Examples of them are banking and finance, transportation,
and medical service [1]. There is a need to improve
the survivability of critical information systems given the increasing
dependence on them, the serious consequences of their failure, and
their demonstrated fragility and vulnerability. The survivability
of a system is defined as the ability of the system to continue
to provide service (possibly degraded) when various changes occur
in the operating environment.
Most critical information
systems are legacy software and/or composed of Commercial-Off-The-Shelf
(COTS) components. They are large-scale distributed systems and
consist of hundreds of computers geographically located nationally
or even globally based on wide area networks. Our Guardian project
[2] aims at enhancing the survivability of such
systems to tolerate failures and co-related malicious attacks while
minimizing the modification to the existing systems. The approach
we adopt is characterized by:
- A sensor- and actuator-driven control system framework. To minimize
the modification of existing systems and provide flexible and
relatively application independent survivability management, a
control system framework is adopted (Figure 1). It consists of
two major parts: sensor/actuator and control system. Sensors collect
failure, attack, performance, and other information about the
critical systems, and actuators execute functions to exert control
over the critical system to change its operation. The control
system obtains the required information from sensors, makes decisions
based on the survivability requirements of the system, and notifies
the actuators to do them. Sensors and actuators are implemented
as shells for the existing systems, requiring minimal modification.
A shell is a layer of software that logically surrounds a software
artifact and either enforces some useful predicates on the state
of system of which the artifact is a component or supplements
the functionality of the artifact in some crucial way.
Figure 1. System framework.
- Survivability Specification. Critical systems are viewed and
analyzed from service perspectives, and the services are classified
quantitatively based on their inter-dependence and criticality
to the whole functions of the system. Failures and attacks are
assessed based on their kind, number of simultaneous occurrences,
and the places where they occur. The survivability requirements
are expressed in a set of predicates which indicate under what
sort of failures/attacks which services should be survived.
- Adaptive and dynamic service survivability management. Survivability
management is the process that keeps meeting the survivability
requirements in the face of changes in the system. The process
is composed of four phases: (1) Change Detection. It detects changes
(e.g. failures and attacks) that have happened in the system.
(2) Damage Assessment. It assesses the damage caused by the changes.
(3) System Adjustment. It isolates the failed services, activates
queuing mechanisms to buffer services requests directed to these
failed services, and slows down the whole system processing pace
in order to provide sufficient time for the next step to work
before the whole system crashes. (4) System Restoration and Adaptation.
Based on the damage to the system and applications, and the availability
of system resources, the control system decides if the damaged
services could be restored or the whole services may be reduced.
It determines which functions will be continued to provide, and
switches the application from one design configuration to another
according to the survivability requirements. Based on the history
of changes happened in the system, it changes the design configurations
of services adaptively to provide more survivability under a given
number of system resources. The adaptation is reflected in another
perspective as well: The above four phases may be used adaptively
for different services based on the criticality of the failed
services.
- Security issues. Secure control systems and sensors/actuators
are a critical issue that we address. A new vulnerability should
not be introduced with the presence of the control system framework.
The control system adopts a hierarchical distributed structure
to avoid single node failure and improve system performance. Authentication
is employed to identify control system components and the data
transferred between them is secured by encryption/decryption.
Control systems are running on separate computers, but the sensor/actuator
should reside with the application on the same machine. We are
investigating methods to secure the sensor/actuator in such an
environment.
There are many research
challenges involved, which include, for example:
(1) Generation of control systems from specifications,
(2) Scalability of control system architectures, and
(3) High assurance systems on the Internet.
We are now studying
the features of critical information systems, analyzing the feasibility
of the approach, proposing a systematic way to use the approach,
and applying it in a prototype system. The prototype system emulates
the payment system of the US banking system, whose survivability
is enhanced by our control system and sensor/actuator framework.
Even though its functions are limited, it shows the potential of
the approach, and provides a testbed for significant follow- on
research.
References
[1]
J. C. Knight, M. C. Elder, J. Flinn, and P. Marx, Summaries
of Three Critical Infrastructure Applications, Technical Report
CS-97-27, Department of Computer Science, University of Virginia,
December 1997.
[2]
J.C. Knight, R. W. Lubinsky, J. McHugh, and K. J. Sullivan, Architectural
Approaches to Information Survivability. Technical Report CS-97-25,
Department of Computer Science, University of Virginia, September
1997.
|