SERE: Similarity-based Expert Re-routing for Efficient Batch Decoding in MoE Models
Juntong Wu, Jialiang Cheng, Fuyu Lv, Ou Dan, Li Yuan

TL;DR
SERE is a dynamic expert re-routing method that improves batch decoding efficiency in MoE models by reducing active experts based on similarity, achieving up to 2x speedup with minimal quality loss.
Contribution
SERE introduces a novel similarity-based expert re-routing technique for efficient batch decoding in MoE models, avoiding static pruning and enabling dynamic expert skipping.
Findings
Up to 2.0x speedup in decoding time.
Minimal loss in model quality.
Effective in complex reasoning benchmarks.
Abstract
Mixture-of-Experts (MoE) architectures employ sparse activation to deliver faster training and inference with higher accuracy than dense LLMs. However, in production serving, MoE models require batch inference to optimize hardware efficiency, which may cause excessive expert activation and thus slow the memory-bound decoding stage. To address the fundamental tension between batch decoding and expert sparsity, we present SERE, a Similarity-based Expert Re-routing method for Efficient batch decoding in MoE models. SERE dynamically reduces the number of active experts in an input-aware manner by re-routing tokens from secondary experts to their most similar primary counterparts. It also leverages similarity patterns to identify and preserve critical experts, thereby preventing capability loss. Notably, SERE avoids static expert pruning or merging, instead enabling dynamic expert skipping…
Peer Reviews
Decision·ICLR 2026 Poster
- This work tackles a critical and interesting challenge in MoE serving by employing a simple and intuitive mitigation method. - The authors' implementation, which features custom CUDA kernels, is good for its easy adaptation to vLLM. - Backed by several insightful observational experiments, the method achieves surprisingly good results.
- The method builds on the observation that experts exhibit high similarity scores. Does this phenomenon occur across all MoE models, or is it specifically a byproduct of using upcycling [1] as the initialization strategy? Minor problem: - It would be helpful to provide more details about the specific problem this work aims to address. At first glance, readers might assume the paper focuses on improving load balancing. Including key terms such as “expert eviction” could better clarify the inten
- The authors address a relevant problem that limits the efficiency and scalability of MoEs. - The idea of the method is simple yet meaningful. The results indicate speedups up to 2x with acceptable accuracy drop. - The authors provide an efficient kernel pluggable into vLLM.
**Reliance on and Sensitivity to Calibration Data:** The method's foundation is a similarity matrix computed on a calibration dataset. This introduces an offline computation step and a data dependency. While the ablations in Tables 4 and 5 show robustness to the choice of a general-domain dataset, the results also hint at a potential domain mismatch problem. For instance, the average similarity scores for the 'Code' domain are consistently lower than for others, suggesting that re-routing decisi
The analysis in this paper is well-reasoned. It begins with a similarity analysis that reveals important insights into the relationships among MoE experts, which then guide subsequent research and design decisions. I like this style. The kernel-level optimization and re-routing mechanism are well-aligned with system-level efficiency goals. This is an impressive aspect of the study. Extensive experiments on multiple MoE architectures (Qwen, DeepSeek) and benchmarks (OpenCompass, MATH, HumanEv
The work is mostly empirical and engineering-oriented; it lacks theoretical analysis of why similarity-based re-routing preserves model behavior or formal guarantees on capacity preservation. No analytical insight is given into how similarity thresholds interact with model generalization or stability.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Mobile Crowdsensing and Crowdsourcing · Stochastic Gradient Optimization Techniques
