PhoenixOS: Concurrent OS-level GPU Checkpoint and Restore with Validated Speculation
Xingda Wei, Zhuobin Huang, Tianle Sun, Yingyi Hao, Rong Chen, Mingcong Han, Jinyu Gu, Haibo Chen

TL;DR
PHOENIXOS introduces a novel OS service enabling concurrent checkpoint and restore of GPU processes, addressing unique GPU challenges with speculative and validated memory access detection for fault tolerance and process migration.
Contribution
It is the first to enable concurrent GPU process checkpointing and restoring at the OS level using speculative validation techniques, improving performance and reliability.
Findings
Orders of magnitude higher performance than NVIDIA cuda-checkpoint.
Effective GPU memory access detection through speculation and validation.
Supports fault tolerance, process migration, and fast startup for GPU workloads.
Abstract
PHOENIXOS (PHOS) is the first OS service that can concurrently checkpoint and restore (C/R) GPU processes--a fundamental capability for critical tasks such as fault tolerance, process migration, and fast startup. While concurrent C/R is well-established on CPUs, it poses unique challenges on GPUs due to their lack of essential features for efficiently tracing concurrent memory reads and writes, such as specific hardware capabilities (e.g., dirty bits) and OS-mediated data paths (e.g., copy-on-write). To ensure correct concurrent C/R, PHOS proactively detects GPU memory reads and writes through a two-step process: first, it speculates about GPU memory accesses based on the arguments used when launching GPU kernels; then, it validates these accesses efficiently at runtime using binary instrumentation. With this validated speculation, PHOS retrofits CPU-based concurrent C/R for GPUs…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems
