By: M. Adams, J.O. Coplien, R. Gamoke, R. Hanmer, F. Keeve, K. Nicodemus
Published in: PLoPD2
Pages: 549-562
Category: Fault-Tolerant Systems, Telecommunications
Summary: Addresses reliability and human factors issues in telecommunications software.
Url: http://www.bell-labs.com/people/cope/patterns/telecom/PLoP95_telecom.html
Addresses reliability and human factors issues in telecommunications software, which must be highly reliable and continuously running.
Pages: 551-552
Downtime, human-induced or otherwise, must be minimized. History has shown that people cause the majority of problems in these systems, so let the machine try to do everything, deferring to a human only as a last resort.
Pages: 552-553
The system must try to recover from all error conditions on its own. To balance automation with human authority and responsibility allow knowledgeable users to override automatic controls.
Pages: 553-554
The human-machine interface is saturated with error reports. Display a message when taking the first action in a series that could lead to an excess number of messages. If the abnormal condition ends, display a message that everything is back to normal. Don't display a message for every change in state. People can't do anything about the messages except watch them anyway. So don't bother printing. This pattern is expanded in Five Minutes of No Escalation Messages [Hanmer+99]
Pages: 554-555
Some errors may be transient. To determine if a problem will work itself out, don't react immediately to detected conditions. Be sure a condition really exists by checking it several times, perhaps using Leaky Bucket Counters
Pages: 555-556
To handle transient faults, keep a counter for each failure group. Initialize the counter to a predetermined value. Decrement the counter for each error or event and increment it periodically (but never beyond its initial value). If the leak rate is faster than the fill rate, then an error condition is indicated.
Pages: 557-558
Give the System Integrity Control Program (SICO) the ability and power to reinitialize the system when system sanity is threatened by error conditions. This program should oversee both the initialization process and the normal application functions so initialization can be restarted if it runs into errors.
Pages: 558-560
The central controller has several configurations with many paths through the subsystems depending on the configuration. To select a workable configuration when there is a faulty subsystem, maintain a configuration counter in hardware and a table that maps from that counter to a configuration state. When the system fails to get through a configuration to a predetermined level of stability, it restarts the system with the configuration that corresponds to the next value of the counter.
Pages: 560-562
You're using Try All Hardware Combos. A latent error can cause a system fault after the configuration counter has been reset. The system then no longer knows that it is in configuration escalation and retries the same configuration that has already failed. The first time the application tells the processor configuration that "all is well," believe it and reset the configuration counter. After that, ignore the request.