Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing
Rizwan A. Ashraf, Saurabh Hukerikar, Christian Engelmann

TL;DR
This paper introduces a pattern-based framework for designing multiresilience solutions in high-performance computing, enabling efficient handling of various error types to improve system reliability.
Contribution
It presents a novel pattern-based approach for constructing integrated resilience solutions across system layers for HPC applications.
Findings
Evaluates detection, containment, and mitigation techniques for transient errors.
Demonstrates a multiresilience design instantiated across multiple system layers.
Shows improved performance and reliability in HPC error handling.
Abstract
Resiliency is the ability of large-scale high-performance computing (HPC) applications to gracefully handle errors, and recover from failures. In this paper, we propose a pattern-based approach to constructing resilience solutions that handle multiple error modes. Using resilience patterns, we evaluate the performance and reliability characteristics of detection, containment and mitigation techniques for transient errors that cause silent data corruptions and techniques for fail-stop errors that result in process failures. We demonstrate the design and implementation of the multiresilience solution based on patterns instantiated across multiple layers of the system stack. The patterns are integrated to work together to achieve resiliency to different error types in a performance-efficient manner.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
