Pattern-based Modeling of High-Performance Computing Resilience
Saurabh Hukerikar, Christian Engelmann

TL;DR
This paper introduces analytical models for evaluating the reliability and performance of pattern-based resilience solutions in high-performance computing systems, aiding the design of robust and efficient HPC architectures.
Contribution
It develops a unified analytical framework to assess and compare resilience design patterns, facilitating the development of reliable HPC systems.
Findings
Models enable evaluation of resilience patterns' reliability and performance.
Framework allows comparison of different resilience solutions.
Supports design of optimized HPC resilience strategies.
Abstract
With the growing scale and complexity of high-performance computing (HPC) systems, resilience solutions that ensure continuity of service despite frequent errors and component failures must be methodically designed to balance the reliability requirements with the overheads to performance and power. Design patterns enable a structured approach to the development of resilience solutions, providing hardware and software designers with the building block elements for the rapid development of novel solutions and for adapting existing technologies for emerging, extreme-scale HPC environments. In this paper, we develop analytical models that enable designers to evaluate the reliability and performance characteristics of the design patterns. These models are particularly useful in building a unified framework that analyzes and compares various resilience solutions built using a combination of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
