Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs

Yehui Tang; Yichun Yin; Yaoyuan Wang; Hang Zhou; Yu Pan; Wei Guo,; Ziyang Zhang; Miao Rang; Fangcheng Liu; Naifu Zhang; Binghan Li; Yonghan; Dong; Xiaojun Meng; Yasheng Wang; Dong Li; Yin Li; Dandan Tu; Can Chen,; Youliang Yan; Fisher Yu; Ruiming Tang; Yunhe Wang; Botian Huang; Bo Wang,; Boxiao Liu; Changzheng Zhang; Da Kuang; Fei Liu; Gang Huang; Jiansheng Wei,; Jiarui Qin; Jie Ran; Jinpeng Li; Jun Zhao; Liang Dai; Lin Li; Liqun Deng,; Peifeng Qin; Pengyuan Zeng; Qiang Gu; Shaohua Tang; Shengjun Cheng; Tao Gao,; Tao Yu; Tianshu Li; Tianyu Bi; Wei He; Weikai Mao; Wenyong Huang; Wulong Liu,; Xiabing Li; Xianzhi Yu; Xueyu Wu; Xu He; Yangkai Du; Yan Xu; Ye Tian; Yimeng; Wu; Yongbing Huang; Yong Tian; Yong Zhu; Yue Li; Yufei Wang; Yuhang Gai,; Yujun Li; Yu Luo; Yunsheng Ni; Yusen Sun; Zelin Chen; Zhe Liu; Zhicheng Liu,; Zhipeng Tu; Zilin Ding; Zongyuan Zhan

arXiv:2505.04519·cs.CL·May 8, 2025

Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs

Yehui Tang, Yichun Yin, Yaoyuan Wang, Hang Zhou, Yu Pan, Wei Guo,, Ziyang Zhang, Miao Rang, Fangcheng Liu, Naifu Zhang, Binghan Li, Yonghan, Dong, Xiaojun Meng, Yasheng Wang, Dong Li, Yin Li, Dandan Tu, Can Chen,, Youliang Yan, Fisher Yu, Ruiming Tang, Yunhe Wang, Botian Huang

PDF

Open Access

TL;DR

This paper presents a comprehensive approach to efficiently train a 718-billion-parameter sparse language model on Ascend NPUs, optimizing hardware utilization and demonstrating scalable training of large MoE models.

Contribution

The paper introduces Pangu Ultra MoE, a large-scale sparse language model trained efficiently on Ascend NPUs, with optimized system strategies and simulation-based hyperparameter selection.

Findings

01

Achieved 30.0% MFU during training on 6K NPUs.

02

Optimized Expert Parallelism reduces communication overhead.

03

Demonstrated scalable training of a 718-billion-parameter MoE model.

Abstract

Sparse large language models (LLMs) with Mixture of Experts (MoE) and close to a trillion parameters are dominating the realm of most capable language models. However, the massive model scale poses significant challenges for the underlying software and hardware systems. In this paper, we aim to uncover a recipe to harness such scale on Ascend NPUs. The key goals are better usage of the computing resources under the dynamic sparse model structures and materializing the expected performance gain on the actual hardware. To select model configurations suitable for Ascend NPUs without repeatedly running the expensive experiments, we leverage simulation to compare the trade-off of various model hyperparameters. This study led to Pangu Ultra MoE, a sparse LLM with 718 billion parameters, and we conducted experiments on the model to verify the simulation results. On the system side, we dig into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Big Data and Digital Economy · Multimodal Machine Learning Applications

MethodsMixture of Experts