SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification
Zhendong Tan, Xingjun Zhang, Chaoyi Hu, Junjie Peng, Kun Xia

TL;DR
SpecPV enhances long-context generation in large language models by using partial verification to significantly accelerate decoding speed while maintaining output quality.
Contribution
It introduces SpecPV, a novel self-speculative decoding method that uses partial verification to reduce bottlenecks in long-context generation.
Findings
Achieves up to 6x decoding speedup
Maintains high output quality with minor degradation
Effective across multiple models and benchmarks
Abstract
Growing demands from tasks like code generation, deep reasoning, and long-document understanding have made long-context generation a crucial capability for large language models (LLMs). Speculative decoding is one of the most direct and effective approaches for accelerating generation. It follows a draft-verify paradigm, where a lightweight draft model proposes several candidate tokens and the target model verifies them. However, we find that as the context length grows, verification becomes the dominant bottleneck. To further accelerate speculative decoding in long-context generation, we introduce SpecPV, a self-speculative decoding approach that performs fast verification using partial key-value states (KV) and periodically applies full verification to eliminate accumulated errors. We validate SpecPV across multiple long-context benchmarks and models, including LLaMA-3.1-8B-Instruct…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
