Efficient MoE Serving in the Memory-Bound Regime: Balance Activated Experts, Not Tokens
Yanpeng Yu, Haiyue Ma, Krish Agarwal, Nicolai Oswald, Qijing Huang, Hugo Linsenmaier, Chunhui Mei, Ritchie Zhao, Ritika Borkar, Bita Darvish Rouhani, David Nellans, Ronny Krashinsky, Anurag Khandelwal

TL;DR
This paper introduces METRO, a novel token-routing algorithm for MoE models that balances activated experts across GPUs in memory-bound regimes, significantly improving latency and throughput during model serving.
Contribution
The paper proposes METRO, a new expert routing method that outperforms existing token-based balancing approaches by focusing on activated experts, with minimal overhead and enhanced performance.
Findings
METRO reduces decode latency by 11-22%.
METRO improves total token throughput by 3-21%.
METRO achieves up to 4.11x throughput gain at fixed latency.
Abstract
Expert Parallelism (EP) permits Mixture of Experts (MoE) models to scale beyond a single GPU. To address load imbalance across GPUs in EP, existing approaches aim to balance the number of tokens each GPU processes. Surprisingly, we find that this objective degrades performance rather than improving it when processing is memory-bound - a common occurrence in MoE serving, especially in the decode phase. Our analysis reveals that balancing the number of tokens processed per GPU increases the number of activated experts, exacerbating memory pressure in the memory-bound regime. We propose Minimum Expert Token ROuting, a novel token-routing algorithm for high-performance expert-parallel MoE serving in the memory-bound regime that balances the number of activated experts per GPU rather than token counts. METRO achieves near-optimal routing quality with minimal computational overhead by…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Mobile Crowdsensing and Crowdsourcing · Graph Theory and Algorithms
