GoCkpt: Gradient-Assisted Multi-Step overlapped Checkpointing for Efficient LLM Training

Keyao Zhang; Yiquan Chen; Zhuo Hu; Wenhai Lin; Jiexiong Xu; Wenzhi Chen

arXiv:2511.07035·cs.OS·November 11, 2025

GoCkpt: Gradient-Assisted Multi-Step overlapped Checkpointing for Efficient LLM Training

Keyao Zhang, Yiquan Chen, Zhuo Hu, Wenhai Lin, Jiexiong Xu, Wenzhi Chen

PDF

Open Access

TL;DR

GoCkpt is a novel checkpointing method that overlaps saving checkpoints with training steps, significantly boosting large language model training efficiency by reducing interruptions and maximizing bandwidth utilization.

Contribution

This paper introduces GoCkpt, a new overlapping checkpointing technique that improves training throughput and reduces interruptions in LLM training.

Findings

01

Training throughput increased by up to 38.4%.

02

Training interruption time reduced by 86.7%.

03

Achieved 4.8% overall throughput improvement.

Abstract

The accuracy of large language models (LLMs) improves with increasing model size, but increasing model complexity also poses significant challenges to training stability. Periodic checkpointing is a key mechanism for fault recovery and is widely used in LLM training. However, traditional checkpointing strategies often pause or delay GPU computation during checkpoint saving for checkpoint GPU-CPU transfer, resulting in significant training interruptions and reduced training throughput. To address this issue, we propose GoCkpt, a method to overlap checkpoint saving with multiple training steps and restore the final checkpoint on the CPU. We transfer the checkpoint across multiple steps, each step transfers part of the checkpoint state, and we transfer some of the gradient data used for parameter updates. After the transfer is complete, each partial checkpoint state is updated to a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Distributed systems and fault tolerance · Advanced Data Storage Technologies