Scaling Laws for Speculative Decoding

Siyuan Yan; Mo Zhu; Guo-qing Jiang; Jianfei Wang; Jiaxing Chen; Wentai Zhang; Xiang Liao; Xiao Cui; Chen Zhang; Zhuoran Song; Ran Zhu

arXiv:2505.07858·cs.CL·May 14, 2025

Scaling Laws for Speculative Decoding

Siyuan Yan, Mo Zhu, Guo-qing Jiang, Jianfei Wang, Jiaxing Chen, Wentai Zhang, Xiang Liao, Xiao Cui, Chen Zhang, Zhuoran Song, Ran Zhu

PDF

TL;DR

This paper establishes scaling laws for speculative decoding in large language models, enabling more efficient inference by coordinating model capacity, batch size, and training data, leading to significant throughput improvements.

Contribution

It introduces Log-linear Scaling Laws for speculative decoding acceptance rates and develops Scylla, a multi-dimensional scaling method for popular LLMs, validated through empirical experiments.

Findings

01

Scylla achieves 1.5-2.2x higher acceptance rate than EAGLE2.

02

Decoding throughput is doubled compared to EAGLE2 in industrial settings.

03

Scaling laws accurately predict decoding efficiency across different model sizes.

Abstract

The escalating demand for efficient decoding in large language models (LLMs) is particularly critical for reasoning-intensive architectures like OpenAI-o3 and DeepSeek-R1, which depend on extended chain-of-thought reasoning. This study investigates speculative decoding techniques through dense LLM architectures to establish foundational insights for accelerating reasoning tasks. While speculative decoding methods leveraging parallel draft-verification cycles have emerged as promising acceleration techniques, the scaling laws governing decoding efficiency remain under-explored compared to conventional backbone LLMs developed through Pretraining->SFT->RLHF training paradigms. In this work, we discover Log-linear Scaling Laws (Theorem 1.1, 1.2 and 1.3) governing draft model acceptance rate (or decoding speed) across three dimensions: pretraining token volume, draft model capacity, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.