LAER-MoE: Load-Adaptive Expert Re-layout for Efficient Mixture-of-Experts Training
Xinyi Liu, Yujie Wang, Fangcheng Fu, Xuefeng Xiao, Huixia Li, Jiashi Li, Bin Cui

TL;DR
LAER-MoE introduces a load-adaptive expert re-layout framework for MoE training, significantly improving load balancing and training efficiency through a novel parallel paradigm and fine-grained scheduling.
Contribution
It proposes Fully Sharded Expert Parallel (FSEP), enabling flexible expert re-layout during training to address load imbalance in MoE models.
Findings
Achieves up to 1.69x acceleration over state-of-the-art systems.
Effectively balances expert load during training.
Reduces communication overhead with fine-grained scheduling.
Abstract
Expert parallelism is vital for effectively training Mixture-of-Experts (MoE) models, enabling different devices to host distinct experts, with each device processing different input data. However, during expert parallel training, dynamic routing results in significant load imbalance among experts: a handful of overloaded experts hinder overall iteration, emerging as a training bottleneck. In this paper, we introduce LAER-MoE, an efficient MoE training framework. The core of LAER-MoE is a novel parallel paradigm, Fully Sharded Expert Parallel (FSEP), which fully partitions each expert parameter by the number of devices and restores partial experts at expert granularity through All-to-All communication during training. This allows for flexible re-layout of expert parameters during training to enhance load balancing. In particular, we perform fine-grained scheduling of communication…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Advanced Neural Network Applications · Privacy-Preserving Technologies in Data
