MoE-GPS: Guidlines for Prediction Strategy for Dynamic Expert Duplication in MoE Load Balancing

Haiyue Ma; Zhixu Du; Yiran Chen

arXiv:2506.07366·cs.LG·June 10, 2025

MoE-GPS: Guidlines for Prediction Strategy for Dynamic Expert Duplication in MoE Load Balancing

Haiyue Ma, Zhixu Du, Yiran Chen

PDF

Open Access

TL;DR

This paper introduces MoE-GPS, a framework for optimizing expert prediction strategies in multi-GPU MoE networks, significantly improving load balancing and inference performance by selecting the best predictor based on system conditions.

Contribution

Proposes MoE-GPS, a system-guided framework that chooses optimal expert prediction strategies, notably advocating for Distribution-Only Prediction to enhance load balancing and reduce overhead.

Findings

01

Distribution-Only Prediction reduces overhead significantly.

02

MoE-GPS improves inference performance by over 23%.

03

Optimal predictor selection depends on system configuration.

Abstract

In multi-GPU Mixture-of-Experts (MoE) network, experts are distributed across different GPUs, which creates load imbalance as each expert processes different number of tokens. Recent works improve MoE inference load balance by dynamically duplicating popular experts to more GPUs to process excessive tokens, which requires predicting the distribution before routing. In this paper, we discuss the tradeoff of prediction strategies, accuracies, overhead, and end-to-end system performance. We propose MoE-GPS, a framework that guides the selection of the optimal predictor design under various system configurations, by quantifying the performance impact to system-level model runtime. Specifically, we advocate for Distribution-Only Prediction, a prediction strategy that only predicts overall token distribution which significantly reduces overhead compared to the traditional Token-to-Expert…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Stochastic Gradient Optimization Techniques

MethodsMixture of Experts