CRUM: Checkpoint-Restart Support for CUDA's Unified Memory
Rohan Garg, Apoore Mohan, Michael Sullivan, Gene Cooperman

TL;DR
CRUM introduces a scalable, low-overhead checkpointing mechanism for CUDA's Unified Virtual Memory, enabling efficient fault tolerance in GPU-accelerated distributed computing environments.
Contribution
It presents CRUM, a novel checkpoint-restart system that supports UVM with minimal runtime overhead and significantly faster checkpointing compared to traditional methods.
Findings
CRUM achieves an average overhead of 6%.
Forked checkpointing is up to 40 times faster.
Supports hybrid CUDA/MPI computations across multiple nodes.
Abstract
Unified Virtual Memory (UVM) was recently introduced on recent NVIDIA GPUs. Through software and hardware support, UVM provides a coherent shared memory across the entire heterogeneous node, migrating data as appropriate. The older CUDA programming style is akin to older large-memory UNIX applications which used to directly load and unload memory segments. Newer CUDA programs have started taking advantage of UVM for the same reasons of superior programmability that UNIX applications long ago switched to assuming the presence of virtual memory. Therefore, checkpointing of UVM will become increasingly important, especially as NVIDIA CUDA continues to gain wider popularity: 87 of the top 500 supercomputers in the latest listings are GPU-accelerated, with a current trend of ten additional GPU-based supercomputers each year. A new scalable checkpointing mechanism, CRUM (Checkpoint-Restart…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Distributed systems and fault tolerance · Advanced Data Storage Technologies
