SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification

Zhendong Tan; Xingjun Zhang; Chaoyi Hu; Junjie Peng; Kun Xia

arXiv:2512.02337·cs.LG·December 3, 2025

SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification

Zhendong Tan, Xingjun Zhang, Chaoyi Hu, Junjie Peng, Kun Xia

PDF

Open Access 4 Models

TL;DR

SpecPV enhances long-context generation in large language models by using partial verification to significantly accelerate decoding speed while maintaining output quality.

Contribution

It introduces SpecPV, a novel self-speculative decoding method that uses partial verification to reduce bottlenecks in long-context generation.

Findings

01

Achieves up to 6x decoding speedup

02

Maintains high output quality with minor degradation

03

Effective across multiple models and benchmarks

Abstract

Growing demands from tasks like code generation, deep reasoning, and long-document understanding have made long-context generation a crucial capability for large language models (LLMs). Speculative decoding is one of the most direct and effective approaches for accelerating generation. It follows a draft-verify paradigm, where a lightweight draft model proposes several candidate tokens and the target model verifies them. However, we find that as the context length grows, verification becomes the dominant bottleneck. To further accelerate speculative decoding in long-context generation, we introduce SpecPV, a self-speculative decoding approach that performs fast verification using partial key-value states (KV) and periodically applies full verification to eliminate accumulated errors. We validate SpecPV across multiple long-context benchmarks and models, including LLaMA-3.1-8B-Instruct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification