Failure Analysis and Quantification for Contemporary and Future Supercomputers
Li Tan, Nathan DeBardeleben

TL;DR
This paper presents a comprehensive quantitative analysis of failure modeling in large-scale supercomputers, integrating various failure types and system levels to evaluate resilience and failure rates for current and future systems.
Contribution
It introduces a formal framework for failure modeling across hierarchical levels and assesses resilience strategies' impact on system failure rates.
Findings
Failure rates can be effectively modeled across system hierarchies.
Resilience strategies significantly reduce failure rates under certain scenarios.
Failure-bounded supercomputers demonstrate improved resilience efficiency.
Abstract
Large-scale computing systems today are assembled by numerous computing units for massive computational capability needed to solve problems at scale, which enables failures common events in supercomputing scenarios. Considering the demanding resilience requirements of supercomputers today, we present a quantitative study on fine-grained failure modeling for contemporary and future large-scale computing systems. We integrate various types of failures from different system hierarchical levels and system components, and summarize the overall system failure rates formally. Given that nowadays system-wise failure rate needs to be capped under a threshold value for reliability and cost-efficiency purposes, we quantitatively discuss different scenarios of system resilience, and analyze the impacts of resilience to different error types on the variation of system failure rates, and the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Effects in Electronics · Distributed systems and fault tolerance · Reliability and Maintenance Optimization
