1. Introduction
Random hardware failures in safety-related systems are typically caused by component wear, degradation, or aging. To effectively control such failures, a combination of reliability engineering techniques must be applied. This article outlines six key strategies to manage random failures and improve the Safety Integrity Level (SIL) of safety-related electrical, electronic, or programmable electronic (E/E/PE) systems in compliance with IEC 61508.
2. Key Strategies for Failure Control
2.1 Use Components with Verified Reliability Data
A reliable safety system must be built using components with documented reliability data. Only with such data can the system failure rate be quantified and controlled. Examples of certified safety components include:
Safety PLCs (Programmable Logic Controllers) that shut off outputs upon internal/external faults
Safety controllers
Safety-rated fieldbus systems (e.g., PROFIsafe, ASi Safety)
Using pre-certified safety elements simplifies system-level SIL calculation and ensures predictable failure behavior.
2.2 Apply Redundant Design
Redundancy refers to the incorporation of extra capacity or components beyond what is strictly necessary for system operation. Types of redundancy include:
Functional redundancy: Parallel safety channels or subsystems
Capacity redundancy: Operating components at only two-thirds of their rated capacity
This design approach enhances system availability and mitigates failure consequences, particularly in high-demand applications.
2.3 Improve Fault Tolerance
Fault tolerance refers to a system’s ability to continue performing its intended function in the presence of a failure. Consider the following cases:
Without Fault Tolerance
If a system consists of components A, B, and C, and any single component failure causes total system failure, the fault tolerance is zero.
With Fault Tolerance
If the system is redesigned such that only two out of three components are needed (2oo3 logic), it can still function correctly even if one component fails—raising the fault tolerance to 1.
Diagram Suggestion: Compare 1oo3 vs. 2oo3 architecture with logic gates (AND/OR) to visually demonstrate fault margin.
Increasing fault tolerance can elevate the SIL by one level (e.g., SIL1 → SIL2).
2.4 Increase Diagnostic Coverage
Diagnostic Coverage (DC) is the proportion of dangerous failures detected by automatic diagnostics. DC is crucial for safety systems that remain dormant for long periods and only activate during fault events. Key points:
High DC improves the probability of timely fault detection
Diagnostic routines help restore the system before actual demand
For redundant safety architectures, simultaneous undetected failure is rare, but still must be addressed
Formula:
DC (%) = Dangerous Detected Failures / Total Dangerous Failures × 100%
Higher DC reduces undetected risk and improves system reliability.
2.5 Perform Regular Proof Testing
Proof testing is a periodic manual or automated test performed to reveal undetected dangerous failures. Its role includes:
Restoring the system to its baseline safety state
Ensuring hidden faults (especially in low-demand systems) do not accumulate
Supporting compliance with SIL targets for PFD (Probability of Dangerous Failure on Demand)
Diagram Suggestion:
A PFD vs. Time curve showing a “sawtooth” pattern — PFD increases over time until proof test reduces it sharply.
Proof testing is vital for dormant protection systems that rarely activate (e.g., emergency shutoff). It is often scheduled during plant shutdowns or major maintenance events.
2.6 Leverage Field Experience and Proven-in-Use Data
Before IEC 61508, safety devices had to accumulate substantial field operation hours before qualification. Post-IEC 61508, two categories are accepted:
Type A: Proven-in-use with substantial operational data
Type B: New or programmable components verified through functional safety assessments
Utilizing devices with operational history reduces uncertainty in failure rate prediction and accelerates safety certification.
3. Real-World Case: Failure Due to Lack of Testing
In a northern Chinese steel plant, a molten steel ladle derailed from the end of a track, causing a catastrophic accident that killed nearly 30 workers during a shift handover. Although a safety limit switch was installed to stop such motion, it had failed—undetected due to the lack of periodic proof testing.
This event underscores a crucial point:
Safety-related systems are often dormant, and their demand is a low-probability event. If proof testing is neglected, latent faults may remain hidden until a disaster occurs.
4. Conclusion
To effectively control random hardware failures in safety-related systems:
Use components with certified reliability
Design for redundancy and fault tolerance
Increase diagnostic and proof testing coverage
Analyze field performance and maintenance history
By adopting these strategies within the IEC 61508 lifecycle framework, organizations can enhance the reliability and integrity of their safety functions and achieve higher SIL compliance.