Sparse Checkpointing for Fast and Reliable MoE Training
Swapnil Gandhi, Christos Kozyrakis

TL;DR
MoEvement introduces a sparse checkpointing system for MoE models that significantly reduces overhead and improves recovery speed, enabling more reliable and efficient large-scale training.
Contribution
The paper presents MoEvement, a novel distributed checkpointing system that leverages sparse snapshots and incremental reconstruction to enhance fault tolerance in MoE training.
Findings
Reduces checkpointing overhead by up to 4x
Speeds up recovery by up to 31x
Maintains high ETTR even with frequent failures
Abstract
As large language models scale, training them requires thousands of GPUs over extended durations--making frequent failures an inevitable reality. While checkpointing remains the primary fault-tolerance mechanism, existing methods fall short when applied to Mixture-of-Experts (MoE) models. Due to their substantially larger training state, MoE models exacerbate checkpointing overheads, often causing costly stalls or prolonged recovery that severely degrade training efficiency. We present MoEvement, a distributed, in-memory checkpointing system tailored for MoE models. MoEvement is built on three key ideas: (1) sparse checkpointing, which incrementally snapshots subsets of experts across iterations to reduce overhead; (2) a sparse-to-dense checkpoint conversion mechanism that incrementally reconstructs consistent dense checkpoints from sparse snapshots; and (3) upstream logging of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Distributed Sensor Networks and Detection Algorithms · Explainable Artificial Intelligence (XAI)
