CRUM: Checkpoint-Restart Support for CUDA's Unified Memory

Rohan Garg; Apoore Mohan; Michael Sullivan; Gene Cooperman

arXiv:1808.00117·cs.DC·August 2, 2018

CRUM: Checkpoint-Restart Support for CUDA's Unified Memory

Rohan Garg, Apoore Mohan, Michael Sullivan, Gene Cooperman

PDF

Open Access

TL;DR

CRUM introduces a scalable, low-overhead checkpointing mechanism for CUDA's Unified Virtual Memory, enabling efficient fault tolerance in GPU-accelerated distributed computing environments.

Contribution

It presents CRUM, a novel checkpoint-restart system that supports UVM with minimal runtime overhead and significantly faster checkpointing compared to traditional methods.

Findings

01

CRUM achieves an average overhead of 6%.

02

Forked checkpointing is up to 40 times faster.

03

Supports hybrid CUDA/MPI computations across multiple nodes.

Abstract

Unified Virtual Memory (UVM) was recently introduced on recent NVIDIA GPUs. Through software and hardware support, UVM provides a coherent shared memory across the entire heterogeneous node, migrating data as appropriate. The older CUDA programming style is akin to older large-memory UNIX applications which used to directly load and unload memory segments. Newer CUDA programs have started taking advantage of UVM for the same reasons of superior programmability that UNIX applications long ago switched to assuming the presence of virtual memory. Therefore, checkpointing of UVM will become increasingly important, especially as NVIDIA CUDA continues to gain wider popularity: 87 of the top 500 supercomputers in the latest listings are GPU-accelerated, with a current trend of ten additional GPU-based supercomputers each year. A new scalable checkpointing mechanism, CRUM (Checkpoint-Restart…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Distributed systems and fault tolerance · Advanced Data Storage Technologies