Polybasic Speculative Decoding Through a Theoretical Perspective

Ruilin Wang; Huixia Li; Yuexiao Ma; Xiawu Zheng; Fei Chao; Xuefeng Xiao; Rongrong Ji

arXiv:2510.26527·cs.LG·October 31, 2025

Polybasic Speculative Decoding Through a Theoretical Perspective

Ruilin Wang, Huixia Li, Yuexiao Ma, Xiawu Zheng, Fei Chao, Xuefeng Xiao, Rongrong Ji

PDF

TL;DR

This paper introduces a theoretically grounded polybasic speculative decoding framework that significantly accelerates large language model inference while maintaining output quality.

Contribution

It presents a novel multi-model speculative decoding approach with rigorous theoretical analysis and practical implementation, surpassing traditional dualistic methods.

Findings

01

Achieves up to 4.43x speedup on various LLMs

02

Provides theoretical characterization of optimal inference time

03

Supports integration with existing speculative techniques

Abstract

Inference latency stands as a critical bottleneck in the large-scale deployment of Large Language Models (LLMs). Speculative decoding methods have recently shown promise in accelerating inference without compromising the output distribution. However, existing work typically relies on a dualistic draft-verify framework and lacks rigorous theoretical grounding. In this paper, we introduce a novel \emph{polybasic} speculative decoding framework, underpinned by a comprehensive theoretical analysis. Specifically, we prove a fundamental theorem that characterizes the optimal inference time for multi-model speculative decoding systems, shedding light on how to extend beyond the dualistic approach to a more general polybasic paradigm. Through our theoretical investigation of multi-model token generation, we expose and optimize the interplay between model capabilities, acceptance lengths, and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.