Draft Model Knows When to Stop: Self-Verification Speculative Decoding for Long-Form Generation

Ziyin Zhang; Jiahao Xu; Tian Liang; Xingyu Chen; Zhiwei He; Rui Wang; Zhaopeng Tu

arXiv:2411.18462·cs.CL·August 26, 2025

Draft Model Knows When to Stop: Self-Verification Speculative Decoding for Long-Form Generation

Ziyin Zhang, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Rui Wang, Zhaopeng Tu

PDF

Open Access 1 Repo 1 Models 1 Video

TL;DR

This paper introduces SVIP, a dynamic length policy for speculative decoding that adaptively determines draft lengths based on draft entropy, significantly improving speed in long-form generation tasks.

Contribution

The paper proposes a training-free, entropy-based dynamic length policy for speculative decoding, addressing limitations of fixed-length policies in complex, long-form generation scenarios.

Findings

01

Achieves up to 17% speedup on MT-Bench at 8K context.

02

Achieves up to 22% speedup on QwQ in long-form reasoning.

03

Demonstrates effectiveness on mainstream and reasoning-heavy benchmarks.

Abstract

Conventional speculative decoding (SD) methods utilize a predefined length policy for proposing drafts, which implies the premise that the target model smoothly accepts the proposed draft tokens. However, reality deviates from this assumption: the oracle draft length varies significantly, and the fixed-length policy hardly satisfies such a requirement. Moreover, such discrepancy is further exacerbated in scenarios involving complex reasoning and long-form generation, particularly under test-time scaling for reasoning-specialized models. Through both theoretical and empirical estimation, we establish that the discrepancy between the draft and target models can be approximated by the draft model's prediction entropy: a high entropy indicates a low acceptance rate of draft tokens, and vice versa. Based on this insight, we propose SVIP: Self-Verification Length Policy for Long-Context…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

geralt-targaryen/svip
pytorchOfficial

Models

🤗
Geralt-Targaryen/QwQ-1.5B-Persona
model· 4 dl· ♡ 2
4 dl♡ 2

Videos

Draft Model Knows When to Stop: Self-Verification Speculative Decoding for Long-Form Generation· underline

Taxonomy

TopicsFormal Methods in Verification

MethodsGuided Language to Image Diffusion for Generation and Editing · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings