MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts   Model Training

Weilin Cai; Le Qin; Jiayi Huang

arXiv:2408.04307·cs.DC·April 10, 2025

MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training

Weilin Cai, Le Qin, Jiayi Huang

PDF

Open Access

TL;DR

The paper introduces MoC-System, a novel fault tolerance approach for sparse Mixture-of-Experts models that significantly reduces checkpointing overhead while maintaining model accuracy.

Contribution

It proposes Partial Experts Checkpointing and hybrid parallel strategies to efficiently manage large MoE model checkpoints, reducing overhead by up to 98.9%.

Findings

01

Up to 98.9% reduction in checkpointing overhead.

02

Maintains comparable or improved model accuracy.

03

Effective in large-scale distributed training environments.

Abstract

As large language models continue to scale up, distributed training systems have expanded beyond 10k nodes, intensifying the importance of fault tolerance. Checkpoint has emerged as the predominant fault tolerance strategy, with extensive studies dedicated to optimizing its efficiency. However, the advent of the sparse Mixture-of-Experts (MoE) model presents new challenges due to the substantial increase in model size, despite comparable computational demands to dense models. In this work, we propose the Mixture-of-Checkpoint System (MoC-System) to orchestrate the vast array of checkpoint shards produced in distributed training systems. MoC-System features a novel Partial Experts Checkpointing (PEC) mechanism, an algorithm-system co-design that strategically saves a selected subset of experts, effectively reducing the MoE checkpoint size to levels comparable with dense models.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed Sensor Networks and Detection Algorithms · Advanced Statistical Process Monitoring · Mobile Crowdsensing and Crowdsourcing

MethodsMixture of Experts