CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM

Twinkle Jain; Gene Cooperman

arXiv:2008.10596·cs.DC·August 25, 2020

CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM

Twinkle Jain, Gene Cooperman

PDF

1 Repo

TL;DR

CRAC is a scalable, low-overhead checkpoint-restart architecture for CUDA applications on NVIDIA GPUs, supporting full CUDA features and enabling fault tolerance in supercomputing environments.

Contribution

It introduces a novel, efficient checkpoint-restart solution for CUDA that supports streams and UVM, reducing overhead and simplifying fault tolerance implementation.

Findings

01

Approximately 1% runtime overhead

02

Fast checkpoint-restart times

03

Supports scalable CUDA streams and UVM

Abstract

The share of the top 500 supercomputers with NVIDIA GPUs is now over 25% and continues to grow. While fault tolerance is a critical issue for supercomputing, there does not currently exist an efficient, scalable solution for CUDA applications on NVIDIA GPUs. CRAC (Checkpoint-Restart Architecture for CUDA) is new checkpoint-restart solution for fault tolerance that supports the full range of CUDA applications. CRAC combines: low runtime overhead (approximately 1% or less); fast checkpoint-restart; support for scalable CUDA streams (for efficient usage of all of the thousands of GPU cores); and support for the full features of Unified Virtual Memory (eliminating the programmer's burden of migrating memory between device and host). CRAC achieves its flexible architecture by segregating application code (checkpointed) and its external GPU communication via non-reentrant CUDA libraries (not…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

DMTCP-CRAC/CRAC-early-development
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.