Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models
Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan, Titov, Dayiheng Liu, Jingren Zhou, Junyang Lin

TL;DR
This paper proposes a global-batch load balancing loss for training Mixture-of-Experts models, which improves expert specialization and overall performance by leveraging diverse data across larger batches.
Contribution
It introduces a global-batch based load balancing loss calculation with an extra communication step, enhancing expert specialization and model performance in large-scale MoE training.
Findings
Global-batch LBL improves pre-training perplexity.
Global-batch LBL enhances domain specialization of experts.
Method scales to models up to 42.8B parameters.
Abstract
This paper revisits the implementation of oad-alancing oss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as , where is the total number of experts, represents the frequency of expert being selected, and denotes the average gating score of the expert . Existing MoE training frameworks usually employ the parallel training strategy so that and the LBL are calculated within a and then averaged across parallel groups. In essence, a micro-batch for training billion-scale LLMs normally contains very few sequences. So, the micro-batch LBL is almost at the sequence level, and the router is pushed to distribute the token evenly within each sequence. Under this strict constraint, even tokens from a domain-specific sequence…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSimulation Techniques and Applications · Scientific Computing and Data Management · AI-based Problem Solving and Planning
MethodsMixture of Experts
