Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining

Costin-Andrei Oncescu; Qingyang Wu; Wai Tong Chung; Robert Wu; Bryan Gopal; Junxiong Wang; Tri Dao; Ben Athiwaratkun

arXiv:2511.02237·cs.LG·November 5, 2025

Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining

Costin-Andrei Oncescu, Qingyang Wu, Wai Tong Chung, Robert Wu, Bryan Gopal, Junxiong Wang, Tri Dao, Ben Athiwaratkun

PDF

Open Access

TL;DR

This paper proposes a batch-aware expert routing method for Mixture-of-Experts models that reduces decode latency by dynamically re-routing tokens to already loaded experts, maintaining accuracy.

Contribution

It introduces a novel dynamic re-routing framework that leverages batch information to lower expert load and decode latency without retraining or accuracy loss.

Findings

01

Achieves 39% and 15% latency reduction on Qwen models

02

Maintains comparable accuracy with reduced latency

03

Effective in large-scale MoE models during autoregressive decoding

Abstract

An increasing number of LLMs employ Mixture-of-Experts (MoE) architectures where the feed-forward layer is replaced by a pool of experts and each token only activates a small subset of them. During autoregressive generation, these models often enter a memory-bound regime even for moderate batch sizes because the average expert load grows more slowly than in an equivalent dense feedforward layer. Consequently, MoE latency is governed by the number of activated experts. We introduce a framework for dynamically re-routing token-to-expert mapping to lower this number (and thus, the decode latency) while preserving a comparable quality. Our best results use a batch-aware routing that works by having tokens piggyback experts that have already been loaded into memory due to being crucial to other tokens within the same batch. Empirically, we evaluate our method on the Qwen3-30B and Qwen3-235B…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Generative Adversarial Networks and Image Synthesis · IoT and Edge/Fog Computing