Demons in the Detail: On Implementing Load Balancing Loss for Training   Specialized Mixture-of-Expert Models

Zihan Qiu; Zeyu Huang; Bo Zheng; Kaiyue Wen; Zekun Wang; Rui Men; Ivan; Titov; Dayiheng Liu; Jingren Zhou; Junyang Lin

arXiv:2501.11873·cs.LG·February 5, 2025

Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models

Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan, Titov, Dayiheng Liu, Jingren Zhou, Junyang Lin

PDF

Open Access

TL;DR

This paper proposes a global-batch load balancing loss for training Mixture-of-Experts models, which improves expert specialization and overall performance by leveraging diverse data across larger batches.

Contribution

It introduces a global-batch based load balancing loss calculation with an extra communication step, enhancing expert specialization and model performance in large-scale MoE training.

Findings

01

Global-batch LBL improves pre-training perplexity.

02

Global-batch LBL enhances domain specialization of experts.

03

Method scales to models up to 42.8B parameters.

Abstract

This paper revisits the implementation of $L$ oad- $b$ alancing $L$ oss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as $N_{E} \sum_{i = 1}^{N_{E}} f_{i} p_{i}$ , where $N_{E}$ is the total number of experts, $f_{i}$ represents the frequency of expert $i$ being selected, and $p_{i}$ denotes the average gating score of the expert $i$ . Existing MoE training frameworks usually employ the parallel training strategy so that $f_{i}$ and the LBL are calculated within a $micro-batch$ and then averaged across parallel groups. In essence, a micro-batch for training billion-scale LLMs normally contains very few sequences. So, the micro-batch LBL is almost at the sequence level, and the router is pushed to distribute the token evenly within each sequence. Under this strict constraint, even tokens from a domain-specific sequence…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSimulation Techniques and Applications · Scientific Computing and Data Management · AI-based Problem Solving and Planning

MethodsMixture of Experts