MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training
Weilin Cai, Le Qin, Jiayi Huang

TL;DR
The paper introduces MoC-System, a novel fault tolerance approach for sparse Mixture-of-Experts models that significantly reduces checkpointing overhead while maintaining model accuracy.
Contribution
It proposes Partial Experts Checkpointing and hybrid parallel strategies to efficiently manage large MoE model checkpoints, reducing overhead by up to 98.9%.
Findings
Up to 98.9% reduction in checkpointing overhead.
Maintains comparable or improved model accuracy.
Effective in large-scale distributed training environments.
Abstract
As large language models continue to scale up, distributed training systems have expanded beyond 10k nodes, intensifying the importance of fault tolerance. Checkpoint has emerged as the predominant fault tolerance strategy, with extensive studies dedicated to optimizing its efficiency. However, the advent of the sparse Mixture-of-Experts (MoE) model presents new challenges due to the substantial increase in model size, despite comparable computational demands to dense models. In this work, we propose the Mixture-of-Checkpoint System (MoC-System) to orchestrate the vast array of checkpoint shards produced in distributed training systems. MoC-System features a novel Partial Experts Checkpointing (PEC) mechanism, an algorithm-system co-design that strategically saves a selected subset of experts, effectively reducing the MoE checkpoint size to levels comparable with dense models.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed Sensor Networks and Detection Algorithms · Advanced Statistical Process Monitoring · Mobile Crowdsensing and Crowdsourcing
MethodsMixture of Experts
