Sparse Checkpointing for Fast and Reliable MoE Training

Swapnil Gandhi; Christos Kozyrakis

arXiv:2412.15411·cs.DC·March 20, 2026

Sparse Checkpointing for Fast and Reliable MoE Training

Swapnil Gandhi, Christos Kozyrakis

PDF

Open Access

TL;DR

MoEvement introduces a sparse checkpointing system for MoE models that significantly reduces overhead and improves recovery speed, enabling more reliable and efficient large-scale training.

Contribution

The paper presents MoEvement, a novel distributed checkpointing system that leverages sparse snapshots and incremental reconstruction to enhance fault tolerance in MoE training.

Findings

01

Reduces checkpointing overhead by up to 4x

02

Speeds up recovery by up to 31x

03

Maintains high ETTR even with frequent failures

Abstract

As large language models scale, training them requires thousands of GPUs over extended durations--making frequent failures an inevitable reality. While checkpointing remains the primary fault-tolerance mechanism, existing methods fall short when applied to Mixture-of-Experts (MoE) models. Due to their substantially larger training state, MoE models exacerbate checkpointing overheads, often causing costly stalls or prolonged recovery that severely degrade training efficiency. We present MoEvement, a distributed, in-memory checkpointing system tailored for MoE models. MoEvement is built on three key ideas: (1) sparse checkpointing, which incrementally snapshots subsets of experts across iterations to reduce overhead; (2) a sparse-to-dense checkpoint conversion mechanism that incrementally reconstructs consistent dense checkpoints from sparse snapshots; and (3) upstream logging of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Distributed Sensor Networks and Detection Algorithms · Explainable Artificial Intelligence (XAI)