Make Every Draft Count: Hidden State based Speculative Decoding

Yuetao Chen; Xuliang Wang; Xinzhou Zheng; Ming Li; Peng Wang; Hong Xu

arXiv:2602.21224·cs.CL·February 26, 2026

Make Every Draft Count: Hidden State based Speculative Decoding

Yuetao Chen, Xuliang Wang, Xinzhou Zheng, Ming Li, Peng Wang, Hong Xu

PDF

Open Access

TL;DR

This paper introduces a novel speculative decoding system that reuses discarded draft hidden states to improve inference efficiency in large language models, achieving up to 3.3x speedup.

Contribution

It proposes a hidden state-based speculative decoding method that transforms discarded drafts into reusable tokens, enhancing efficiency over traditional token-based approaches.

Findings

01

Achieves up to 3.3x speedup over standard speculative decoding.

02

Introduces a draft model architecture based on auto-regressive hidden states.

03

Designs an efficient token injection mechanism for high-quality draft reuse.

Abstract

Speculative decoding has emerged as a pivotal technique to accelerate LLM inference by employing a lightweight draft model to generate candidate tokens that are subsequently verified by the target model in parallel. However, while this paradigm successfully increases the arithmetic intensity of memory-bound inference, it causes significant compute inefficiency: the majority of draft tokens fail verification and are discarded, resulting in waste of computation. Motivated by the goal of recollecting this wasted computation, we propose a novel system that transforms discarded drafts into reusable tokens. Our key insight is to perform auto-regressive prediction at the hidden states level and postpone the integrating token information after the hidden states generation, so the draft hidden states are not contaminated by incorrect tokens, enabling hidden state reuse. To implement such a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Generative Adversarial Networks and Image Synthesis · Natural Language Processing Techniques