The management of faults in a small homogeneous network is fairly
straightforward and easy to accomplish. However, as the size of networks
increase, and they become increasingly more complex, fault management becomes
more problematical and time-consuming. Many different approaches to telecommunications
network fault management currently exist. They include, among many others,
expert systems [Todd, 1988], evidential theory [Dawes, Altoft, Pagurek,
1995], neural networks [Koivo, 1994], model-based approaches [Frank, 1992].
All of these methods have their own merits in their methods of resolving
network faults, but nonetheless all suffer from various weaknesses in their
implementations which can limit their effectiveness in telecommunications
network fault management.
In this abstract, we present the intelligent fault management system
for telecommunications networks we are developing. The system is based upon
the Assumption-based Truth Maintenance System (ATMS) [de Kleer, 1986] but
the ATMS structures have been modified to provide support for the fault
management capabilities of a telecommunications network fault management
system [Liu, 1998]. The design of the fault management system has been made
as generic as possible in order that it can be implemented in a wide a range
of telecommunication networks with minimal change to the management system
structure.
The general structure of the network fault management system may be visualised
by referring to Figure 1.
The main features of the system are:
- Simple Network Configuration Database. The network configuration database,
which maintains details of the network devices and the network connection
topology, has a minimalist structure, via means of the adapted ATMS structures,
whereby each network node is configured with only the knowledge of its
immediate neighbour nodes to which the node is directly connected; unlike
many ANMs where each node has detailed knowledge of the whole network structure.
The use of a minimalist configuration database permits rapid diagnosis
of network faults and easy maintenance of the network configuration.
- High level control program (HLCP). This program has overall control
of the fault management system and communicates directly with monitor nodes,
strategically positioned around the telecommunications network. Each of
these monitor nodes overlooks a section of the network and is responsible
for monitoring the status and health of all network devices within its
section. The HLCP polls the monitor nodes periodically and receives status
messages in return which indicate the status of all devices in the network.
When the HLCP receives a status message indicating a fault within the network,
the HLCP invokes the fault diagnostics algorithm to act upon the fault.
Diagnostic information received from the fault diagnostic algorithm is
then relayed to the network operator for further action. When a number
of fault messages are received, the HLCP will prioritise the fault reports
in order of importance or urgency and will act on the higher priority faults
first.
- Incorporation of Uncertainty Management. The determination of failure
of a device can rarely be made with total accuracy, especially in large
complex systems. In telecommunication networks, this can be due to a number
of factors such as network congestion which may prevent a device reporting
its status, or unreliability of the network monitor itself which may result
in incorrect status messages being sent to the HLCP. The incorporation
of uncertainty management into the fault management system should permit
the fault diagnostic system to evaluate fault reports in respect of known
reliability measures which exist for each device and connecting link in
the network. Thus it should be possible to generate more intuitive fault
diagnostics which more accurately deal with imprecise fault reports.
- Graphical Interface. A graphical user interface is being developed
through which the network operator will interact with the fault management
system in all aspects of network maintenance and fault management. The
network operator will be able to set system parameters which determine
how the management system responds to network faults, dynamically re-configure
the network to add or remove network nodes, and request clarification of
diagnosis decisions made by the fault diagnostics engine.
- Parallel Implementation. The early prototype version was developed
as a sequential implementation and this provided an ideal test-bed for
the concepts and ideas proposed for the system [Wells, Liu, Adamson, 1998].
This sequential system is now being re-engineered to produce a parallel
version of the fault management system, implemented on a network of INMOS
transputers. The parallel system will handle multiple network faults concurrently
- the only limitation being the number of processors in the transputer
network. It is anticipated that this will lead to a significant speedup
in the diagnosis of multiple faults in a network.
References
[Dawes, Altoft, Pagurek, 1995] Dawes, N., Altoft, J., Pagurek,
B. Network Diagnosis by Reasoning in Nested Evidence Spaces, IEEE
Transactions On Communications,
Vol. 32, No. 2/3/4, February, March, April, 1995, pp. 466-76
[de Kleer, 1986] de Kleer, Johan., An Assumption-based TMS,
Artificial Intelligence 28, (1986),
pp. 127-61.
[Frank, P.M., 1992] Frank P.M. Principles of
Model-Based Fault Detection, Proceedings of IFAC Artificial Intelligence
in Real-Time Control, 1992, pp. 213-220
[Koivo, 1994] Koivo, H. N., Artificial Neural Networks In Fault
Diagnosis and Control, Control Engineering, Practice, Vol. 2, No. 1, pp.
89-101
[Liu, W, 1996] Liu, W, A Domain Independent Data Structure
For Telecommunications Using Adapted ATMS, to appear at IPMU-98, July 1998,
Paris.
[Todd, 1988] Todd, E. Marques A Symptom-Driven Expert System
for Isolating and Correcting Network Faults, IEEE Communications Magazine,
March 1988, Vol. 26, No. 3
[Wells, Liu, Adamson, 1998] Wells, N.T., Liu, W., Adamson, K.,
Using the ATMS for Fault Management in Telecommunications Networks, to appear
at IPMU-98, July 1998, Paris.
|