All is Not Lost: LLM Recovery without Checkpoints

Nikolay Blagoev; O\u{g}uzhan Ersoy; Lydia Yiyu Chen

arXiv:2506.15461·cs.DC·April 7, 2026

All is Not Lost: LLM Recovery without Checkpoints

Nikolay Blagoev, O\u{g}uzhan Ersoy, Lydia Yiyu Chen

PDF

1 Repo

TL;DR

This paper introduces CheckFree and CheckFree+ methods for recovering large language models during decentralized training without checkpoints, reducing overhead and improving efficiency.

Contribution

Proposes checkpoint-free recovery techniques for LLM training that outperform traditional checkpointing and redundancy in efficiency and scalability.

Findings

01

CheckFree achieves up to 12% faster convergence than redundant computation.

02

CheckFree+ extends recovery to first and last stages using out-of-order pipelining.

03

Both methods require no additional computation or storage, only small overhead for embedding layers.

Abstract

Training LLMs on decentralized nodes or on-spot instances, lowers the training cost and enables model democratization. The inevitable challenge here is the transient churns of nodes due to failures and the operator's scheduling policies, leading to losing parts of the model (some layers). The conventional approaches to recover from failures is to either use checkpointing, where periodically a copy of the entire model is sent to an additional storage, or redundant computation. These approaches yield significant communication and/or computation overhead even in non-failure cases and scale poorly in settings with large models. In this paper we propose CheckFree, an efficient recovery method where a failing stage is substituted by weighted averaging of the closest neighboring stages. In contrast to the state of the art, CheckFree requires no additional computation or storage. However,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gensyn-ai/CheckFree
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.