Fault-Tolerant Hybrid-Parallel Training at Scale with Reliable and Efficient In-memory Checkpointing
Yuxin Wang, Xueze Kang, Shaohuai Shi, Xin He, Zhenheng Tang, Xinglin, Pan, Yang Zheng, Xiaoyu Wu, Amelie Chi Zhou, Bingsheng He, Xiaowen Chu

TL;DR
This paper presents a novel distributed in-memory checkpointing system for hybrid-parallel GPU training that significantly reduces overhead and improves failure recovery efficiency, enabling scalable large model training.
Contribution
It introduces hierarchical asynchronous snapshotting and intra-node redundancy techniques to minimize checkpointing overhead and enhance failure resilience in hybrid-parallel training.
Findings
Achieves zero in-memory checkpointing overhead on Frontier for Llama-2-34B training.
Enables fast restart of failed hybrid-parallel training jobs.
Demonstrates scalability and efficiency improvements over existing methods.
Abstract
To efficiently scale large model (LM) training, researchers transition from data parallelism (DP) to hybrid parallelism (HP) on GPU clusters, which frequently experience hardware and software failures. Existing works introduce in-memory checkpointing optimizations that snapshot parameters to device memory for rapid failure recovery. However, these methods introduce severe resource competition between checkpointing and training, which can work under DP but can hardly scale under resource-intensive HP. To ensure low checkpointing overhead for hybrid-parallel training, this paper introduces a distributed in-memory checkpointing system with near-zero in-memory saving overhead. It strives from two aspects to mitigate the on-host resource competition caused by in-memory checkpointing: (1) It introduces Hierarchical Asynchronous Snapshotting Coordination in the checkpoint saving stage. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed systems and fault tolerance · Ferroelectric and Negative Capacitance Devices · Cognitive Functions and Memory
