MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE

Zongle Huang; Lei Zhu; Zongyuan Zhan; Ting Hu; Weikai Mao; Xianzhi Yu; Yongpan Liu; Tianyu Zhang

arXiv:2505.19645·cs.LG·February 17, 2026

MoESD: Unveil Speculative Decoding's Potential for Accelerating Sparse MoE

Zongle Huang, Lei Zhu, Zongyuan Zhan, Ting Hu, Weikai Mao, Xianzhi Yu, Yongpan Liu, Tianyu Zhang

PDF

Open Access

TL;DR

This paper explores how speculative decoding can significantly accelerate sparse Mixture of Experts models in large language models, especially at medium batch sizes, providing theoretical insights and practical speedup results.

Contribution

It demonstrates that speculative decoding benefits MoE models more than dense models at medium batch sizes and introduces a new metric 'target efficiency' to better understand SD performance.

Findings

01

Up to 2.29x speedup on GPUs for MoE models.

02

Speculative decoding benefits increase as MoE models become sparser.

03

Theoretical analysis aligns with experimental results.

Abstract

Large Language Models (LLMs) have achieved remarkable success across many applications, with Mixture of Experts (MoE) models demonstrating great potential. Compared to traditional dense models, MoEs achieve better performance with less computation. Speculative decoding (SD) is a widely used technique to accelerate LLM inference without accuracy loss, but it has been considered efficient only for dense models. In this work, we first demonstrate that, under medium batch sizes, MoE surprisingly benefits more from SD than dense models. Furthermore, as MoE becomes sparser -- the prevailing trend in MoE designs -- the batch size range where SD acceleration is expected to be effective becomes broader. To quantitatively understand tradeoffs involved in SD, we develop a reliable modeling based on theoretical analyses. While current SD research primarily focuses on improving acceptance rates of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Cognitive Computing and Networks · Graph Theory and Algorithms

MethodsMixture of Experts · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings