Prediction Is All MoE Needs: Expert Load Distribution Goes from   Fluctuating to Stabilizing

Peizhuang Cong; Aomufei Yuan; Shimao Chen; Yuxuan Tian; Bowen Ye; Tong; Yang

arXiv:2404.16914·cs.LG·April 29, 2024·1 cites

Prediction Is All MoE Needs: Expert Load Distribution Goes from Fluctuating to Stabilizing

Peizhuang Cong, Aomufei Yuan, Shimao Chen, Yuxuan Tian, Bowen Ye, Tong, Yang

PDF

Open Access

TL;DR

This paper analyzes expert load fluctuations in Mixture of Experts models, demonstrating that load prediction algorithms can stabilize expert utilization, thereby improving resource efficiency during training.

Contribution

It introduces a detailed analysis of expert load fluctuations and applies classical prediction algorithms to achieve stable expert load predictions in MoE models.

Findings

01

Average prediction error rates of 1.3% and 1.8% for next 1,000 and 2,000 steps

02

Identification of transient and stable load states in training

03

Potential guidance for expert placement and resource allocation

Abstract

MoE facilitates the development of large models by making the computational complexity of the model no longer scale linearly with increasing parameters. The learning sparse gating network selects a set of experts for each token to be processed; however, this may lead to differences in the number of tokens processed by each expert over several successive iterations, i.e., the expert load fluctuations, which reduces computational parallelization and resource utilization. To this end, we traced and analyzed loads of each expert in the training iterations for several large language models in this work, and defined the transient state with "obvious load fluctuation" and the stable state with "temporal locality". Moreover, given the characteristics of these two states and the computational overhead, we deployed three classical prediction algorithms that achieve accurate expert load prediction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsForecasting Techniques and Applications · Big Data and Business Intelligence

MethodsSparse Evolutionary Training · Mixture of Experts