Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs
Yehui Tang, Yichun Yin, Yaoyuan Wang, Hang Zhou, Yu Pan, Wei Guo,, Ziyang Zhang, Miao Rang, Fangcheng Liu, Naifu Zhang, Binghan Li, Yonghan, Dong, Xiaojun Meng, Yasheng Wang, Dong Li, Yin Li, Dandan Tu, Can Chen,, Youliang Yan, Fisher Yu, Ruiming Tang, Yunhe Wang, Botian Huang

TL;DR
This paper presents a comprehensive approach to efficiently train a 718-billion-parameter sparse language model on Ascend NPUs, optimizing hardware utilization and demonstrating scalable training of large MoE models.
Contribution
The paper introduces Pangu Ultra MoE, a large-scale sparse language model trained efficiently on Ascend NPUs, with optimized system strategies and simulation-based hyperparameter selection.
Findings
Achieved 30.0% MFU during training on 6K NPUs.
Optimized Expert Parallelism reduces communication overhead.
Demonstrated scalable training of a 718-billion-parameter MoE model.
Abstract
Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models. However, the massive model scale poses significant challenges for the underlying software and hardware systems. In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs. The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware. To select model configurations suitable for Ascend NPUs without repeatedly running the expensive experiments, we leverage simulation to compare the trade-off of various model hyperparameters. This study led to Pangu Ultra MoE, a sparse LLM with 718 billion parameters, and we conducted experiments on the model to verify the simulation results. On the system side, we dig into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Big Data and Digital Economy · Multimodal Machine Learning Applications
MethodsMixture of Experts
