ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload
Ziyue Liu, Zhengyang Wang, Ruijie Zhang, Avinash Maurya, Hui Zhou, Paul Hovland, Sheng Di, Franck Cappello, Bogdan Nicolae, Zheng Zhang

TL;DR
ReCoVer is a fault-tolerant pre-training system for large language models that maintains training consistency and efficiency despite hardware failures across large GPU clusters.
Contribution
It introduces a novel, parallelism-agnostic framework with three protocol layers that ensures resilient LLM pre-training without drifting from failure-free trajectories.
Findings
Successfully maintains training trajectory despite 256 GPUs failures.
Achieves 2.23x higher effective throughput compared to checkpointing.
Processes 74.9% more tokens at 234 GPU-hours with increased prolongation.
Abstract
Pre-training large language models on massive GPU clusters has made hardware faults routine rather than rare, driving the need for resilient training systems. Yet existing frameworks either focus on specific parallelism schemes or risk drifting away from a failure-free training trajectory. We propose ReCoVer, a resilient LLM pre-training system that upholds a single invariant: each iteration keeps the number of microbatches constant, ensuring per-iteration gradients remain stochastically equivalent to a failure-free run. The framework is organized as three decoupled protocol layers: (1) Fault-tolerant collectives that isolate faults from propagating across replicas; (2) in-step fine-grained recovery that preserves intra-iteration progress and prevents gradient corruption; (3) versatile-workload policy that dynamically redistributes microbatch quotas across the survivors. The design is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
