MoE-GPS: Guidlines for Prediction Strategy for Dynamic Expert Duplication in MoE Load Balancing
Haiyue Ma, Zhixu Du, Yiran Chen

TL;DR
This paper introduces MoE-GPS, a framework for optimizing expert prediction strategies in multi-GPU MoE networks, significantly improving load balancing and inference performance by selecting the best predictor based on system conditions.
Contribution
Proposes MoE-GPS, a system-guided framework that chooses optimal expert prediction strategies, notably advocating for Distribution-Only Prediction to enhance load balancing and reduce overhead.
Findings
Distribution-Only Prediction reduces overhead significantly.
MoE-GPS improves inference performance by over 23%.
Optimal predictor selection depends on system configuration.
Abstract
In multi-GPU Mixture-of-Experts (MoE) network, experts are distributed across different GPUs, which creates load imbalance as each expert processes different number of tokens. Recent works improve MoE inference load balance by dynamically duplicating popular experts to more GPUs to process excessive tokens, which requires predicting the distribution before routing. In this paper, we discuss the tradeoff of prediction strategies, accuracies, overhead, and end-to-end system performance. We propose MoE-GPS, a framework that guides the selection of the optimal predictor design under various system configurations, by quantifying the performance impact to system-level model runtime. Specifically, we advocate for Distribution-Only Prediction, a prediction strategy that only predicts overall token distribution which significantly reduces overhead compared to the traditional Token-to-Expert…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques
MethodsMixture of Experts
