X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts Architectures on HPC Platforms
Yueming Yuan, Ahan Gupta, Jianping Li, Sajal Dash, Feiyi Wang, Minjia Zhang

TL;DR
X-MoE introduces a scalable training system for Mixture-of-Experts architectures, overcoming memory and communication bottlenecks, enabling training of models up to 545 billion parameters on HPC platforms.
Contribution
The paper presents X-MoE, a novel training system with cross-platform kernels and hybrid parallelism, significantly improving scalability of MoE models on non-NVIDIA HPC hardware.
Findings
Scales MoE models up to 545 billion parameters on 1024 GPUs.
Achieves 10x larger models than previous methods within the same hardware budget.
Maintains high training throughput on AMD-based supercomputers.
Abstract
Emerging expert-specialized Mixture-of-Experts (MoE) architectures, such as DeepSeek-MoE, deliver strong model quality through fine-grained expert segmentation and large top-k routing. However, their scalability is limited by substantial activation memory overhead and costly all-to-all communication. Furthermore, current MoE training systems - primarily optimized for NVIDIA GPUs - perform suboptimally on non-NVIDIA platforms, leaving significant computational potential untapped. In this work, we present X-MoE, a novel MoE training system designed to deliver scalable training performance for next-generation MoE architectures. X-MoE achieves this via several novel techniques, including efficient padding-free MoE training with cross-platform kernels, redundancy-bypassing dispatch, and hybrid parallelism with sequence-sharded MoE blocks. Our evaluation on the Frontier supercomputer, powered…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
