Think Before You Accept: Semantic Reflective Verification for Faster Speculative Decoding
Yixuan Wang, Yijun Liu, Shiyu ji, Yuzhuang Xu, Yang Xu, Qingfu Zhu, Wanxiang Che

TL;DR
This paper introduces Reflective Verification, a semantics-aware, training-free method leveraging LLMs' reflective capacity to improve speculative decoding speed by verifying draft tokens at a semantic level, enhancing efficiency without sacrificing accuracy.
Contribution
It proposes a novel, semantics-aware verification technique that uses prompt-based probing to assess draft tokens, outperforming distributional methods and complementing existing statistical verification approaches.
Findings
Significantly increases acceptance length of draft tokens.
Combines effectively with statistical verification for 5-15% speedup.
Maintains model performance while accelerating inference.
Abstract
Large language models (LLMs) suffer from high inference latency due to the auto-regressive decoding process. Speculative decoding accelerates inference by generating multiple draft tokens using a lightweight model and verifying them in parallel. However, existing verification methods rely heavily on distributional consistency while overlooking semantic correctness, thereby limiting the potential speedup of speculative decoding. While some methods employ additional models for relaxed verification of draft tokens, they often fail to generalize effectively to more diverse or open-domain settings. In this work, we propose Reflective Verification, a training-free and semantics-aware approach that achieves a better trade-off between correctness and efficiency. Specifically, we leverage the inherent reflective capacity of LLMs to semantically assess the correctness of draft tokens in parallel…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- Introduces a novel, training-free method that leverages LLMs’ self-reflective capabilities to assess semantic correctness, going beyond traditional distributional checks. - Can be seamlessly integrated with existing speculative decoding and verification frameworks, demonstrating strong generalization across models and domains. - Significantly increases draft token acceptance length and achieves an additional 5–15% decoding speedup without sacrificing model performance.
- The paper does not clearly define core terms like “semantic correctness,” making the approach harder to interpret and evaluate. - The rationale behind fusing semantic and consistency-based token distributions lacks detailed explanation, leaving concerns about potential interference and effectiveness. - The role and selection of key parameters (e.g., alpha) are not well explained, limiting the method’s reproducibility and adaptability across different models and tasks.
- Obtaining the original and semantic logits in parallel in a single forward pass incurs little additional overhead compared to standard speculative decoding. - The approach of fusing logits to modify the target distribution is (to my knowledge) novel, and seems to me to be an intriguing direction of further research. - Strong empirical results demonstrate larger mean accepted tokens with minimal quality degradation. - Important hyper-parameters are ablated and justified.
- Experiments are only conducted on the Llama 3 model family. Evaluations with other model families would strengthen the claims of the paper. - I have minor concerns about the underlying mechanisms of the method. The prompt used to obtain the semantic logits includes the verified prefix, the newly drafted tokens, a reflective prompt, a position prefix (which is a suffix of the verified prefix), and the newly drafted tokens again. By the nature of LLMs, it seems the target model is heavily incent
The paper is well-written. The proposed method is conceptually simple and easy to understand.
- The original speculative decoding algorithms (https://arxiv.org/abs/2211.17192, https://arxiv.org/abs/2302.01318) are mathematically guaranteed to maintain the same output distribution as the target LLM. The proposed method, while empirically shown to be useful, is technically no longer a speculative algorithm and lacks a theoretical explanation. - The method introduces multiple hyperparameters - the mixing weight $\alpha$, and the reflective prompt. These raise complexities in real-world dep
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
