When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?

Tianyu Liu; Yuhao Shen; Xinyi Hu; Baolin Zhang; Hengxin Zhang; Jun Dai; Jun Zhang; Shuang Ge; Lei Chen; Yue Li; MingCheng Wan

arXiv:2604.26412·cs.CL·May 12, 2026

When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?

Tianyu Liu, Yuhao Shen, Xinyi Hu, Baolin Zhang, Hengxin Zhang, Jun Dai, Jun Zhang, Shuang Ge, Lei Chen, Yue Li, MingCheng Wan

PDF

TL;DR

This paper investigates how reusing key-value caches can improve long-range speculative decoding in large language models, revealing structural bottlenecks and proposing diagnostic tools.

Contribution

It introduces KVShot, a diagnostic framework comparing different cache reuse paradigms, and identifies key bottlenecks for enhancing long-range decoding.

Findings

01

KV-Reuse improves long-range acceptance in decoding.

02

Shallow drafters struggle with query estimation accuracy.

03

Sparse gradient signals hinder KV projection training.

Abstract

Speculative decoding accelerates LLM inference, but SOTA hidden-state-based drafters suffer from long-range decay: draft accuracy degrades as the speculative step increases. Existing work attributes this decay to train-inference mismatch and proposes test-time training (TTT) as a remedy, yet we observe that long-range decay persists even in TTT-trained drafters. We revisit long-range decay from the perspective of context information preservation. In hidden-state reuse, we argue the target hidden state acts as a biased context compression: it aggregates historical token information according to the attention query at the current position, yielding a compact representation optimized for immediate next-token prediction. This compression can suppress information less relevant to the current query but important for later speculative steps. In contrast, the target model's KV cache serves as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.