TL;DR
This paper introduces CheckFree and CheckFree+ methods for recovering large language models during decentralized training without checkpoints, reducing overhead and improving efficiency.
Contribution
Proposes checkpoint-free recovery techniques for LLM training that outperform traditional checkpointing and redundancy in efficiency and scalability.
Findings
CheckFree achieves up to 12% faster convergence than redundant computation.
CheckFree+ extends recovery to first and last stages using out-of-order pipelining.
Both methods require no additional computation or storage, only small overhead for embedding layers.
Abstract
Training LLMs on decentralized nodes or on-spot instances, lowers the training cost and enables model democratization. The inevitable challenge here is the transient churns of nodes due to failures and the operator's scheduling policies, leading to losing parts of the model (some layers). The conventional approaches to recover from failures is to either use checkpointing, where periodically a copy of the entire model is sent to an additional storage, or redundant computation. These approaches yield significant communication and/or computation overhead even in non-failure cases and scale poorly in settings with large models. In this paper we propose CheckFree, an efficient recovery method where a failing stage is substituted by weighted averaging of the closest neighboring stages. In contrast to the state of the art, CheckFree requires no additional computation or storage. However,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
