Efficient MoE Serving in the Memory-Bound Regime: Balance Activated Experts, Not Tokens

Yanpeng Yu; Haiyue Ma; Krish Agarwal; Nicolai Oswald; Qijing Huang; Hugo Linsenmaier; Chunhui Mei; Ritchie Zhao; Ritika Borkar; Bita Darvish Rouhani; David Nellans; Ronny Krashinsky; Anurag Khandelwal

arXiv:2512.09277·cs.DC·December 11, 2025

Efficient MoE Serving in the Memory-Bound Regime: Balance Activated Experts, Not Tokens

Yanpeng Yu, Haiyue Ma, Krish Agarwal, Nicolai Oswald, Qijing Huang, Hugo Linsenmaier, Chunhui Mei, Ritchie Zhao, Ritika Borkar, Bita Darvish Rouhani, David Nellans, Ronny Krashinsky, Anurag Khandelwal

PDF

Open Access

TL;DR

This paper introduces METRO, a novel token-routing algorithm for MoE models that balances activated experts across GPUs in memory-bound regimes, significantly improving latency and throughput during model serving.

Contribution

The paper proposes METRO, a new expert routing method that outperforms existing token-based balancing approaches by focusing on activated experts, with minimal overhead and enhanced performance.

Findings

01

METRO reduces decode latency by 11-22%.

02

METRO improves total token throughput by 3-21%.

03

METRO achieves up to 4.11x throughput gain at fixed latency.

Abstract

Expert Parallelism (EP) permits Mixture of Experts (MoE) models to scale beyond a single GPU. To address load imbalance across GPUs in EP, existing approaches aim to balance the number of tokens each GPU processes. Surprisingly, we find that this objective degrades performance rather than improving it when processing is memory-bound - a common occurrence in MoE serving, especially in the decode phase. Our analysis reveals that balancing the number of tokens processed per GPU increases the number of activated experts, exacerbating memory pressure in the memory-bound regime. We propose Minimum Expert Token ROuting, a novel token-routing algorithm for high-performance expert-parallel MoE serving in the memory-bound regime that balances the number of activated experts per GPU rather than token counts. METRO achieves near-optimal routing quality with minimal computational overhead by…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Mobile Crowdsensing and Crowdsourcing · Graph Theory and Algorithms