Havens: Explicit Reliable Memory Regions for HPC Applications
Saurabh Hukerikar, Christian Engelmann

TL;DR
Havens introduces a software-based memory protection scheme using fault-protected regions called havens, aimed at enhancing error resilience in future exascale supercomputers with high memory fault rates.
Contribution
The paper proposes a novel region-based memory management approach with havens that provide fault protection for critical program objects in HPC applications.
Findings
Provides fault protection for program objects using havens.
Application-agnostic fault coverage.
Enables placement of critical objects in protected regions.
Abstract
Supporting error resilience in future exascale-class supercomputing systems is a critical challenge. Due to transistor scaling trends and increasing memory density, scientific simulations are expected to experience more interruptions caused by transient errors in the system memory. Existing hardware-based detection and recovery techniques will be inadequate to manage the presence of high memory fault rates. In this paper we propose a partial memory protection scheme based on region-based memory management. We define the concept of regions called havens that provide fault protection for program objects. We provide reliability for the regions through a software-based parity protection mechanism. Our approach enables critical program objects to be placed in these havens. The fault coverage provided by our approach is application agnostic, unlike algorithm-based fault tolerance techniques.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRadiation Effects in Electronics · Distributed systems and fault tolerance · Security and Verification in Computing
