Language Support for Reliable Memory Regions
Saurabh Hukerikar, Christian Engelmann

TL;DR
This paper introduces language support for reliable memory regions, called havens, enabling explicit resilience management in HPC systems to address increasing errors and heterogeneity at exascale.
Contribution
It extends previous haven-based memory management with language annotations, making resilient memory regions more explicit and easier for HPC programmers to use.
Findings
Annotations improve resilience of HPC applications
Implementation demonstrates effective haven management
Enhanced reliability in conjugate gradient solver
Abstract
The path to exascale computational capabilities in high-performance computing (HPC) systems is challenged by the inadequacy of present software technologies to adapt to the rapid evolution of architectures of supercomputing systems. The constraints of power have driven system designs to include increasingly heterogeneous architectures and diverse memory technologies and interfaces. Future systems are also expected to experience an increased rate of errors, such that the applications will no longer be able to assume correct behavior of the underlying machine. To enable the scientific community to succeed in scaling their applications, and to harness the capabilities of exascale systems, we need software strategies that provide mechanisms for explicit management of resilience to errors in the system, in addition to locality of reference in the complex memory hierarchies of future HPC…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Distributed systems and fault tolerance · Cloud Computing and Resource Management
