MoE-Spec: Expert Budgeting for Efficient Speculative Decoding

Bradley McDanel; Steven Li; Sruthikesh Surineni; Harshit Khaitan

arXiv:2602.16052·cs.LG·February 19, 2026

MoE-Spec: Expert Budgeting for Efficient Speculative Decoding

Bradley McDanel, Steven Li, Sruthikesh Surineni, Harshit Khaitan

PDF

Open Access

TL;DR

MoE-Spec introduces a training-free expert budgeting technique for speculative decoding in MoE models, significantly improving throughput by managing expert activation and memory use without retraining.

Contribution

It proposes a novel expert budgeting method that decouples speculation depth from memory costs, enhancing decoding efficiency without additional training.

Findings

01

Achieves 10-30% higher throughput than baselines

02

Maintains comparable quality with improved efficiency

03

Enables flexible trade-offs between accuracy and latency

Abstract

Speculative decoding accelerates Large Language Model (LLM) inference by verifying multiple drafted tokens in parallel. However, for Mixture-of-Experts (MoE) models, this parallelism introduces a severe bottleneck: large draft trees activate many unique experts, significantly increasing memory pressure and diminishing speedups from speculative decoding relative to autoregressive decoding. Prior methods reduce speculation depth when MoE verification becomes expensive. We propose MoE-Spec, a training-free verification-time expert budgeting method that decouples speculation depth from memory cost by enforcing a fixed expert capacity limit at each layer, loading only the experts that contribute most to verification and dropping the long tail of rarely used experts that drive bandwidth overhead. Experiments across multiple model scales and datasets show that this method yields 10--30\%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Topic Modeling · Generative Adversarial Networks and Image Synthesis