Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing
Peizhuang Cong, Aomufei Yuan, Shimao Chen, Yuxuan Tian, Bowen Ye, Tong, Yang

TL;DR
This paper analyzes expert load fluctuations in Mixture of Experts models, demonstrating that load prediction algorithms can stabilize expert utilization, thereby improving resource efficiency during training.
Contribution
It introduces a detailed analysis of expert load fluctuations and applies classical prediction algorithms to achieve stable expert load predictions in MoE models.
Findings
Average prediction error rates of 1.3% and 1.8% for next 1,000 and 2,000 steps
Identification of transient and stable load states in training
Potential guidance for expert placement and resource allocation
Abstract
MoE facilitates the development of large models by making the computational complexity of the model no longer scale linearly with increasing parameters. The learning sparse gating network selects a set of experts for each token to be processed; however, this may lead to differences in the number of tokens processed by each expert over several successive iterations, i.e., the expert load fluctuations, which reduces computational parallelization and resource utilization. To this end, we traced and analyzed loads of each expert in the training iterations for several large language models in this work, and defined the transient state with "obvious load fluctuation" and the stable state with "temporal locality". Moreover, given the characteristics of these two states and the computational overhead, we deployed three classical prediction algorithms that achieve accurate expert load prediction…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsForecasting Techniques and Applications · Big Data and Business Intelligence
MethodsSparse Evolutionary Training · Mixture of Experts
