CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads

Radostin Stoyanov; Vikt\'oria Spi\v{s}akov\'a; Jesus Ramos; Steven; Gurfinkel; Andrei Vagin; Adrian Reber; Wesley Armour; Rodrigo Bruno

arXiv:2502.16631·cs.DC·February 25, 2025

CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads

Radostin Stoyanov, Vikt\'oria Spi\v{s}akov\'a, Jesus Ramos, Steven, Gurfinkel, Andrei Vagin, Adrian Reber, Wesley Armour, Rodrigo Bruno

PDF

Open Access 1 Repo

TL;DR

CRIUgpu introduces a hardware-assisted, transparent checkpointing method for GPU workloads that eliminates performance overhead and improves recovery times, facilitating resource sharing and fault tolerance in large-scale GPU environments.

Contribution

It presents a novel, driver-based approach for transparent GPU checkpointing that overcomes limitations of existing API interception techniques.

Findings

01

Works with CUDA and ROCm applications across multiple GPUs.

02

Eliminates steady-state performance overheads.

03

Reduces recovery times significantly.

Abstract

Deep learning training at scale is resource-intensive and time-consuming, often running across hundreds or thousands of GPUs for weeks or months. Efficient checkpointing is crucial for running these workloads, especially in multi-tenant environments where compute resources are shared, and job preemptions or interruptions are common. However, transparent and unified GPU snapshots are particularly challenging because of the hardware architecture differences between CPU and GPU, including memory subsystems, dynamic parallelism, and thread synchronization. State-of-the-art GPU checkpointing techniques typically leverage mechanisms that intercept, log, and replay device API calls. However, this approach adds performance overhead and requires hardware-specific implementation that is difficult to test, maintain, and integrate with existing container platforms. In this paper, we present CRIUgpu…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

checkpoint-restore/criu
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Distributed systems and fault tolerance