MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production

Chao Jin; Ziheng Jiang; Zhihao Bai; Zheng Zhong; Juncai Liu; Xiang Li; Ningxin Zheng; Xi Wang; Cong Xie; Qi Huang; Wen Heng; Yiyuan Ma; Wenlei Bao; Size Zheng; Yanghua Peng; Haibin Lin; Xuanzhe Liu; Xin Jin; Xin Liu

arXiv:2505.11432·cs.LG·October 20, 2025

MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production

Chao Jin, Ziheng Jiang, Zhihao Bai, Zheng Zhong, Juncai Liu, Xiang Li, Ningxin Zheng, Xi Wang, Cong Xie, Qi Huang, Wen Heng, Yiyuan Ma, Wenlei Bao, Size Zheng, Yanghua Peng, Haibin Lin, Xuanzhe Liu, Xin Jin, Xin Liu

PDF

Open Access

TL;DR

MegaScale-MoE introduces a communication-efficient, scalable system for training large MoE models, significantly improving throughput and efficiency on massive GPU clusters.

Contribution

The paper presents a novel system design for efficient large-scale MoE training, including communication strategies and compression techniques tailored for production environments.

Findings

01

Achieved 1.41M tokens/sec training throughput on 352B MoE model

02

Improved training efficiency by 1.88× over Megatron-LM

03

Demonstrated effective communication compression and parallelism strategies

Abstract

We present MegaScale-MoE, a production system tailored for the efficient training of large-scale mixture-of-experts (MoE) models. MoE emerges as a promising architecture to scale large language models (LLMs) to unprecedented sizes, thereby enhancing model performance. However, existing MoE training systems experience a degradation in training efficiency, exacerbated by the escalating scale of MoE models and the continuous evolution of hardware. Recognizing the pivotal role of efficient communication in enhancing MoE training, MegaScale-MoE customizes communication-efficient parallelism strategies for attention and FFNs in each MoE layer and adopts a holistic approach to overlap communication with computation at both inter- and intra-operator levels. Additionally, MegaScale-MoE applies communication compression with adjusted communication patterns to lower precision, further improving…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Machine Learning and Algorithms

MethodsSoftmax · Attention Is All You Need · Mixture of Experts