Utility-Driven Speculative Decoding for Mixture-of-Experts
Anish Saxena, Po-An Tsai, Hritvik Taneja, Aamer Jaleel, Moinuddin Qureshi

TL;DR
This paper introduces Cascade, a dynamic framework that selectively enables speculative decoding in Mixture-of-Experts models, significantly improving throughput and avoiding slowdowns caused by naive speculation methods.
Contribution
Cascade is a utility-driven, dynamic approach that tunes speculation parameters in MoE models, making speculative decoding practical and efficient.
Findings
Cascade reduces slowdowns to 5% from 1.5x.
It improves throughput by 7-14% over static K.
Cascade effectively adapts to different tasks and models.
Abstract
GPU memory bandwidth is the main bottleneck for low-latency Large Language Model (LLM) inference. Speculative decoding leverages idle GPU compute by using a lightweight drafter to propose K tokens, which the LLM verifies in parallel, boosting token throughput. In conventional dense LLMs, all model weights are fetched each iteration, so speculation adds no latency overhead. Emerging Mixture of Experts (MoE) models activate only a subset of weights per token, greatly reducing data movement. However, we show that speculation is ineffective for MoEs: draft tokens collectively activate more weights, increasing data movement and verification time by 2-3x. When token throughput gains fail to offset this overhead, speculation causes slowdowns up to 1.5x, making it infeasible. Even when useful, the optimal K varies by task, model, and even between requests and iterations. Thus, despite…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Algorithms · Target Tracking and Data Fusion in Sensor Networks · Distributed Sensor Networks and Detection Algorithms
MethodsSparse Evolutionary Training · Mixture of Experts
