Latent Prototype Routing: Achieving Near-Perfect Load Balancing in Mixture-of-Experts
Jiajie Yang

TL;DR
This paper introduces Latent Prototype Routing (LPR), a new expert routing method for Mixture-of-Experts models that significantly improves load balancing and resource utilization without sacrificing model performance.
Contribution
LPR offers a generalized routing framework based on clustering that enhances load balancing in MoE models, addressing a key limitation of existing approaches.
Findings
Reduces Gini coefficient of expert load from 0.70 to 0.035
Improves min-max expert load ratio from 1e-6 to 0.70
Achieves near-perfect load balancing in multiple MoE models
Abstract
Mixture-of-Experts (MoE) architectures have emerged as a key strategy for scaling large language models (LLMs) efficiently. However, current MoE systems suffer from severe load imbalance, where only a small subset of experts is consistently activated during training and inference, leading to significant underutilization of model capacity and computational resources. In this work, we revisit expert routing through a clustering perspective and propose Latent Prototype Routing (LPR), a novel routing framework that generalizes existing approaches while promoting balanced expert utilization without compromising downstream performance. Extensive experiments across multiple open-source MoE models -- including DeepSeek-V3, Qwen3-MoE, and Mixtral -- demonstrate that LPR reduces the Gini coefficient of expert load from 0.70 to 0.035 on average, improves the min-max expert load ratio from 1e-6 to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Topic Modeling · Recommender Systems and Techniques
MethodsMixture of Experts
