bob bowman

Hardware Availability is a function of hardware reliability and the maintainability/serviceability features of that hardware. The principle way of achieving availability goals is to first achieve the reliability goals. This is the most cost-effective way for both the producer and customer to achieve these goals. However, even with the best design and manufacturing practices, random failures will still occur in complex systems. Some of these will result in system outages for a simple non redundant configuration. Service then becomes extremely important in limiting the system downtime to a small period of time.

Again, the key to achieving high-availability is to increase the reliability of the system as much as reasonably possible in a cost-effective manner. This requires a two phase development process aimed first at fault avoidance (failure & error prevention) through reduced design complexity, usage of higher grade components, reduced stress levels on components, proper tolerancing of all critical parameters, and the elimination of all significant (critical) single points-of-failure. This step in the development process yields the best possible basic non redundant system. Many systems today require a near 100% availability. Therefore, an additional step is needed to further increase the reliability of the system. This step introduces selective fault tolerant features. Fault tolerance is the capability of a system to perform its function according to design specifications in the presence of failures and errors. The principle way of achieving a fault tolerant design is with dynamic redundancy. Redundancy can be in the form of repeated execution or physical replication of hardware that provides an alternate path for successful operation.

Maintainability and serviceability features (both software and hardware) are also added to the system in order to minimize the downtime period as a result of any system outage. These are the outages that are the result of failures that are not shielded from the customer by fault tolerance. This last line of defense is critical because if these features do their job, any resulting downtime will be greatly minimized. Keeping the system mean-time-to-repair (MTTR) to a small period of time is the final key to achieving high-available systems.