A Pattern Language for High-Performance Computing Resilience
Saurabh Hukerikar, Christian Engelmann

TL;DR
This paper introduces a structured pattern language to guide the design of resilient high-performance computing systems, addressing the complexity of fault management in large-scale, intricate HPC architectures.
Contribution
It develops a novel pattern language framework that organizes and relates resilience techniques for HPC systems, facilitating systematic and adaptable fault tolerance solutions.
Findings
Pattern language clarifies relationships among resilience techniques.
Enables systematic design of comprehensive HPC resilience solutions.
Supports exploration of alternative fault handling strategies.
Abstract
High-performance computing systems (HPC) provide powerful capabilities for modeling, simulation, and data analytics for a broad class of computational problems. They enable extreme performance of the order of quadrillion floating-point arithmetic calculations per second by aggregating the power of millions of compute, memory, networking and storage components. With the rapidly growing scale and complexity of HPC systems for achieving even greater performance, ensuring their reliable operation in the face of system degradations and failures is a critical challenge. System fault events often lead the scientific applications to produce incorrect results, or may even cause their untimely termination. The sheer number of components in modern extreme-scale HPC systems and the complex interactions and dependencies among the hardware and software components, the applications, and the physical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
