Fault-Tolerant Telecommunication System Patterns


By: M. Adams, J.O. Coplien, R. Gamoke, R. Hanmer, F. Keeve, K. Nicodemus
Published in: PLoPD2
Pages: 549-562
Category: Fault-Tolerant Systems, Telecommunications

Summary: Addresses reliability and human factors issues in telecommunications software.

Url: http://www.bell-labs.com/people/cope/patterns/telecom/PLoP95_telecom.html

Addresses reliability and human factors issues in telecommunications software, which must be highly reliable and continuously running.

Pattern: Minimize Human Intervention

Pages: 551-552

Downtime, human-induced or otherwise, must be minimized. History has shown that people cause the majority of problems in these systems, so let the machine try to do everything, deferring to a human only as a last resort.

Pattern: People Know Best

Pages: 552-553

The system must try to recover from all error conditions on its own. To balance automation with human authority and responsibility allow knowledgeable users to override automatic controls.

Pattern: Five Minutes of No Escalation Messages

Pages: 553-554

The human-machine interface is saturated with error reports. Display a message when taking the first action in a series that could lead to an excess number of messages. If the abnormal condition ends, display a message that everything is back to normal. Don't display a message for every change in state. People can't do anything about the messages except watch them anyway. So don't bother printing. This pattern is expanded in Five Minutes of No Escalation Messages [Hanmer+99]

Pattern: Riding Over Transients

Pages: 554-555

Some errors may be transient. To determine if a problem will work itself out, don't react immediately to detected conditions. Be sure a condition really exists by checking it several times, perhaps using Leaky Bucket Counters

Pattern: Leaky Bucket Counters

Pages: 555-556

To handle transient faults, keep a counter for each failure group. Initialize the counter to a predetermined value. Decrement the counter for each error or event and increment it periodically (but never beyond its initial value). If the leak rate is faster than the fill rate, then an error condition is indicated.

Pattern: SICO First and Always

Pages: 557-558

Give the System Integrity Control Program (SICO) the ability and power to reinitialize the system when system sanity is threatened by error conditions. This program should oversee both the initialization process and the normal application functions so initialization can be restarted if it runs into errors.

Pattern: Try All Hardware Combos

Pages: 558-560

The central controller has several configurations with many paths through the subsystems depending on the configuration. To select a workable configuration when there is a faulty subsystem, maintain a configuration counter in hardware and a table that maps from that counter to a configuration state. When the system fails to get through a configuration to a predetermined level of stability, it restarts the system with the configuration that corresponds to the next value of the counter.

Pattern: Fool Me Once

Pages: 560-562

You're using Try All Hardware Combos. A latent error can cause a system fault after the configuration counter has been reset. The system then no longer knows that it is in configuration escalation and retries the same configuration that has already failed. The first time the application tells the processor configuration that "all is well," believe it and reset the configuration counter. After that, ignore the request.