Route Experts by Sequence, not by Token

Tiansheng Wen; Yifei Wang; Aosong Feng; Long Ma; Xinyang Liu; Yifan Wang; Lixuan Guo; Bo Chen; Stefanie Jegelka; Chenyu You

arXiv:2511.06494·cs.LG·March 30, 2026

Route Experts by Sequence, not by Token

Tiansheng Wen, Yifei Wang, Aosong Feng, Long Ma, Xinyang Liu, Yifan Wang, Lixuan Guo, Bo Chen, Stefanie Jegelka, Chenyu You

PDF

1 Repo

TL;DR

SeqTopK is a simple, efficient routing method for Mixture-of-Experts models that allocates experts at the sequence level, improving performance especially in high sparsity regimes.

Contribution

It introduces a minimal, learnable, sequence-level expert allocation method that requires little code change and enhances MoE performance without retraining.

Findings

01

Consistent improvements over TopK and prior adaptive methods.

02

Significant gains up to 16.9% in high sparsity regimes.

03

Requires less than 1% overhead and is compatible with pretrained models.

Abstract

Mixture-of-Experts (MoE) architectures scale large language models (LLMs) by activating only a subset of experts per token, but the standard TopK routing assigns the same fixed number of experts to all tokens, ignoring their varying complexity. Prior adaptive routing methods introduce additional modules and hyperparameters, often requiring costly retraining from scratch. We propose Sequence-level TopK (SeqTopK), a minimal modification that shifts the expert budget from the token level to the sequence level. By selecting the top $T \cdot K$ experts across all $T$ tokens, SeqTopK enables end-to-end learned dynamic allocation -- assigning more experts to difficult tokens and fewer to easy ones -- while preserving the same overall budget. SeqTopK requires only a few lines of code, adds less than 1% overhead, and remains fully compatible with pretrained MoE models. Experiments across math,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Y-Research-SBU/SeqTopK
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.