Draft Model Knows When to Stop: Self-Verification Speculative Decoding for Long-Form Generation
Ziyin Zhang, Jiahao Xu, Tian Liang, Xingyu Chen, Zhiwei He, Rui Wang, Zhaopeng Tu

TL;DR
This paper introduces SVIP, a dynamic length policy for speculative decoding that adaptively determines draft lengths based on draft entropy, significantly improving speed in long-form generation tasks.
Contribution
The paper proposes a training-free, entropy-based dynamic length policy for speculative decoding, addressing limitations of fixed-length policies in complex, long-form generation scenarios.
Findings
Achieves up to 17% speedup on MT-Bench at 8K context.
Achieves up to 22% speedup on QwQ in long-form reasoning.
Demonstrates effectiveness on mainstream and reasoning-heavy benchmarks.
Abstract
Conventional speculative decoding (SD) methods utilize a predefined length policy for proposing drafts, which implies the premise that the target model smoothly accepts the proposed draft tokens. However, reality deviates from this assumption: the oracle draft length varies significantly, and the fixed-length policy hardly satisfies such a requirement. Moreover, such discrepancy is further exacerbated in scenarios involving complex reasoning and long-form generation, particularly under test-time scaling for reasoning-specialized models. Through both theoretical and empirical estimation, we establish that the discrepancy between the draft and target models can be approximated by the draft model's prediction entropy: a high entropy indicates a low acceptance rate of draft tokens, and vice versa. Based on this insight, we propose SVIP: Self-Verification Length Policy for Long-Context…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsFormal Methods in Verification
MethodsGuided Language to Image Diffusion for Generation and Editing · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
