Binary-Integer-Programming Based Algorithm for Expert Load Balancing in Mixture-of-Experts Models
Yuan Sun

TL;DR
This paper introduces a binary integer programming-based algorithm for expert load balancing in Mixture-of-Experts models, significantly improving training efficiency and load distribution during pre-training.
Contribution
It proposes a novel BIP-based expert load balancing algorithm that maintains balanced expert loads throughout pre-training, outperforming existing methods.
Findings
Achieves lower perplexities on two MoE language models.
Reduces pre-training time by at least 13%.
Maintains load balance on every expert during all training steps.
Abstract
For pre-training of MoE (Mixture-of-Experts) models, one of the main issues is unbalanced expert loads, which may cause routing collapse or increased computational overhead. Existing methods contain the Loss-Controlled method and the Loss-Free method, where both the unbalanced degrees at first several training steps are still high and decrease slowly. In this work, we propose BIP-Based Balancing, an expert load balancing algorithm based on binary integer programming (BIP). The algorithm maintains an additional vector q on each MoE layer that can help change the top-K order of s by solving a binary integer programming with very small time costs. We implement the algorithm on two MoE language models: 16-expert (0.3B) and 64-expert (1.1B). The experimental results show that on both models comparing with the Loss-Controlled method and the Loss-Free method, our algorithm trains models with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDistributed Sensor Networks and Detection Algorithms
MethodsMixture of Experts
