Industrial Computing Systems: A Case Study of Fault Tolerance Analysis
Andrey A. Shchurov

TL;DR
This paper analyzes the failure rates of industrial computing systems over their lifespan, focusing on fault tolerance, maintenance scheduling, and extending operational life under financial constraints.
Contribution
It introduces a method to analyze failure rates and optimize maintenance scheduling to improve fault tolerance and system longevity.
Findings
Failure rate increases critically at end-of-life
Maintenance scheduling can mitigate failure risks
Extended fault-tolerant operation is achievable
Abstract
Fault tolerance is a key factor of industrial computing systems design. But in practical terms, these systems, like every commercial product, are under great financial constraints and they have to remain in operational state as long as possible due to their commercial attractiveness. This work provides an analysis of the instantaneous failure rate of these systems at the end of their life-time period. On the basis of this analysis, we determine the effect of a critical increase in the system failure rate and the basic condition of its existence. The next step determines the maintenance scheduling which can help to avoid this effect and to extend the system life-time in fault-tolerant mode.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
