A European ATM service provider reported on an incident following planned intervention on the electrical supply system. During the switch back to mains power a significant system failure occurred. The effects of the event included unavailability of various operational systems for 01:37 hours. The event was provoked by aging STS (Static Transfer Switch) components of the electrical supply system.
The incident occurred during planned maintenance and renovation works on the high tension power network. At regular intervals, interchanges between the public power grid and the emergency power generators were performed. During the last interchange a power cut with a total duration of approximately 100 ms occurred.
The effects of this event lasted for 01:37 hours during which time various operational systems were either unavailable or unable to distribute messages.
A detailed analysis of power network measurements, equipment logs and IT-system logs, followed by factory tests of power system components allowed to determine the following findings:
A succession of fluctuations in electrical frequency synchronisation between the power system components led to a slight change from the normal frequency of the electrical power signal.
Factory test of STS components showed that this fluctuation exceeded the tolerances in these components because of their variation from design tolerance attributed to ageing (10 years old).
As a conclusion, the root cause of the unintended power cut was determined to have been STS components erroneously recording a degradation in the power quality (a frequency tolerance overshoot) and leading to a very short cut in power supply to operational systems.
Reported Actions and Lessons Learned
Short term solution: the automatic transfer of power source between the STS components was temporarily disabled to avoid reoccurrence of the incident if similar synchronisation fluctuations should appear.
Permanent solution: the degraded STS components will be replaced. In addition the architecture (involving power supply to the entire IT platform via two redundant STS components) and settings of the entire power supply system will be reviewed and if found necessary improved.
Decisions in crisis or system degradation events should be practiced to ensure quick reaction in such critical outages. The prescribed procedures should be made as simple as possible.
All maintenance interventions on power supply systems should be preceded by a formal safety analysis of potential significant operational effects.
Crisis management checklists should be developed to promote consistent and rapid decision making.
Your Attention is Required
ATM Service Providers are invited to note the subject and share their experience with similar cases.