ConFu: Contemplate the Future for Better Speculative Sampling
Zongyue Qin, Raghavv Goel, Mukul Gagrani, Risheek Garrepalli, Mingu Lee, Yizhou Sun

TL;DR
ConFu introduces a future-aware speculative decoding framework that enhances draft model predictions by anticipating future tokens, significantly improving inference speed for large language models.
Contribution
It proposes a novel future-oriented speculative decoding method with mechanisms for future prediction, improving speed and accuracy over existing approaches.
Findings
ConFu improves token acceptance rates and generation speed over EAGLE-3 by 8-11% on Llama-3 models.
ConFu achieves approximately 20% speedup on Qwen-3 models across downstream tasks.
The framework effectively leverages future signals, reducing error accumulation in draft models.
Abstract
Speculative decoding has emerged as a powerful approach to accelerate large language model (LLM) inference by employing lightweight draft models to propose candidate tokens that are subsequently verified by the target model. The effectiveness of this paradigm critically depends on the quality of the draft model. While recent advances such as the EAGLE series achieve state-of-the-art speedup, existing draft models remain limited by error accumulation: they condition only on the current prefix, causing their predictions to drift from the target model over steps. In this work, we propose \textbf{ConFu} (Contemplate the Future), a novel speculative decoding framework that enables draft models to anticipate the future direction of generation. ConFu introduces (i) contemplate tokens and soft prompts that allow the draft model to leverage future-oriented signals from the target model at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
