ByteCheckpoint: A Unified Checkpointing System for Large Foundation Model Development
Borui Wan, Mingji Han, Yiyao Sheng, Yanghua Peng, Haibin Lin, Mofan, Zhang, Zhichao Lai, Menghan Yu, Junda Zhang, Zuquan Song, Xin Liu, Chuan Wu

TL;DR
ByteCheckpoint is a high-performance, flexible checkpointing system designed for large-scale large foundation model training, enabling efficient resharding, multi-framework support, and scalable I/O operations.
Contribution
It introduces a parallelism-agnostic checkpoint representation and a generic workflow, significantly improving efficiency and flexibility over existing systems.
Findings
Reduces checkpoint runtime stalls by 54.20x on average.
Achieves up to 9.96x faster checkpoint saving.
Achieves up to 8.80x faster checkpoint loading.
Abstract
Checkpointing to preserve training states is crucial during the development of Large Foundation Models (LFMs), for training resumption upon various failures or changes in GPU resources and parallelism configurations. In addition, saved checkpoints are dispatched to evaluation tasks or transferred across different training stages (e.g., from pre-training to post-training). All these scenarios require resharding distributed checkpoints from one parallelism to another. In production environments, different LFMs are trained with various frameworks and storage backends, depending on model sizes and training scales. A high-performance checkpointing system is needed to enable efficient checkpoint management at scale throughout the lifecycle of LFM development. We introduce ByteCheckpoint, an industrial-grade checkpointing system for large-scale LFM training. ByteCheckpoint features: a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed and Parallel Computing Systems · Scientific Computing and Data Management · Business Process Modeling and Analysis
