Language Support for Reliable Memory Regions

Saurabh Hukerikar; Christian Engelmann

arXiv:1611.02823·cs.DC·November 24, 2016

Language Support for Reliable Memory Regions

Saurabh Hukerikar, Christian Engelmann

PDF

Open Access

TL;DR

This paper introduces language support for reliable memory regions, called havens, enabling explicit resilience management in HPC systems to address increasing errors and heterogeneity at exascale.

Contribution

It extends previous haven-based memory management with language annotations, making resilient memory regions more explicit and easier for HPC programmers to use.

Findings

01

Annotations improve resilience of HPC applications

02

Implementation demonstrates effective haven management

03

Enhanced reliability in conjugate gradient solver

Abstract

The path to exascale computational capabilities in high-performance computing (HPC) systems is challenged by the inadequacy of present software technologies to adapt to the rapid evolution of architectures of supercomputing systems. The constraints of power have driven system designs to include increasingly heterogeneous architectures and diverse memory technologies and interfaces. Future systems are also expected to experience an increased rate of errors, such that the applications will no longer be able to assume correct behavior of the underlying machine. To enable the scientific community to succeed in scaling their applications, and to harness the capabilities of exascale systems, we need software strategies that provide mechanisms for explicit management of resilience to errors in the system, in addition to locality of reference in the complex memory hierarchies of future HPC…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Distributed systems and fault tolerance · Cloud Computing and Resource Management