Making Every Verified Token Count: Adaptive Verification for MoE Speculative Decoding
Lehan Pan, Ziyang Tao, Ruoyu Pang, Xiao Wang, Jianjun Zhao, Yanyong Zhang

TL;DR
EVICT is a novel adaptive verification method for MoE speculative decoding that improves speed and reduces verification costs without additional training or tuning.
Contribution
It introduces a training-free, lossless adaptive verification technique that optimizes token verification in MoE models, enhancing decoding efficiency.
Findings
Achieves up to 2.35x speedup over autoregressive decoding.
Provides an average 1.21x speedup over EAGLE-3.
Reduces unnecessary expert activations during verification.
Abstract
Tree-based speculative decoding accelerates autoregressive generation by verifying multiple draft candidates in parallel, but this advantage weakens for sparse Mixture-of-Experts (MoE) models. As the draft tree grows, different branches activate different experts, expanding the union of activated experts and substantially increasing target-side verification cost. We propose EVICT, a training-free, hyperparameter-free, and lossless adaptive verification method for MoE speculative decoding. EVICT makes every verified token count by truncating the draft tree before target verification and retaining only the cost-effective prefix. It leverages fine-grained drafter signals to estimate candidate benefit, combines them with offline-profiled verification cost, and remains highly compatible with the high-performance graph-based serving framework SGLang. Extensive experiments on diverse MoE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
