Utility-Driven Speculative Decoding for Mixture-of-Experts

Anish Saxena; Po-An Tsai; Hritvik Taneja; Aamer Jaleel; Moinuddin Qureshi

arXiv:2506.20675·cs.DC·June 27, 2025

Utility-Driven Speculative Decoding for Mixture-of-Experts

Anish Saxena, Po-An Tsai, Hritvik Taneja, Aamer Jaleel, Moinuddin Qureshi

PDF

Open Access

TL;DR

This paper introduces Cascade, a dynamic framework that selectively enables speculative decoding in Mixture-of-Experts models, significantly improving throughput and avoiding slowdowns caused by naive speculation methods.

Contribution

Cascade is a utility-driven, dynamic approach that tunes speculation parameters in MoE models, making speculative decoding practical and efficient.

Findings

01

Cascade reduces slowdowns to 5% from 1.5x.

02

It improves throughput by 7-14% over static K.

03

Cascade effectively adapts to different tasks and models.

Abstract

GPU memory bandwidth is the main bottleneck for low-latency Large Language Model (LLM) inference. Speculative decoding leverages idle GPU compute by using a lightweight drafter to propose K tokens, which the LLM verifies in parallel, boosting token throughput. In conventional dense LLMs, all model weights are fetched each iteration, so speculation adds no latency overhead. Emerging Mixture of Experts (MoE) models activate only a subset of weights per token, greatly reducing data movement. However, we show that speculation is ineffective for MoEs: draft tokens collectively activate more weights, increasing data movement and verification time by 2-3x. When token throughput gains fail to offset this overhead, speculation causes slowdowns up to 1.5x, making it infeasible. Even when useful, the optimal K varies by task, model, and even between requests and iterations. Thus, despite…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning and Algorithms · Target Tracking and Data Fusion in Sensor Networks · Distributed Sensor Networks and Detection Algorithms

MethodsSparse Evolutionary Training · Mixture of Experts