CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads
Radostin Stoyanov, Vikt\'oria Spi\v{s}akov\'a, Jesus Ramos, Steven, Gurfinkel, Andrei Vagin, Adrian Reber, Wesley Armour, Rodrigo Bruno

TL;DR
CRIUgpu introduces a hardware-assisted, transparent checkpointing method for GPU workloads that eliminates performance overhead and improves recovery times, facilitating resource sharing and fault tolerance in large-scale GPU environments.
Contribution
It presents a novel, driver-based approach for transparent GPU checkpointing that overcomes limitations of existing API interception techniques.
Findings
Works with CUDA and ROCm applications across multiple GPUs.
Eliminates steady-state performance overheads.
Reduces recovery times significantly.
Abstract
Deep learning training at scale is resource-intensive and time-consuming, often running across hundreds or thousands of GPUs for weeks or months. Efficient checkpointing is crucial for running these workloads, especially in multi-tenant environments where compute resources are shared, and job preemptions or interruptions are common. However, transparent and unified GPU snapshots are particularly challenging because of the hardware architecture differences between CPU and GPU, including memory subsystems, dynamic parallelism, and thread synchronization. State-of-the-art GPU checkpointing techniques typically leverage mechanisms that intercept, log, and replay device API calls. However, this approach adds performance overhead and requires hardware-specific implementation that is difficult to test, maintain, and integrate with existing container platforms. In this paper, we present CRIUgpu…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Distributed systems and fault tolerance
