Scaling Speculative Decoding with Lookahead Reasoning
Yichao Fu, Rui Ge, Zelei Shao, Zhijie Deng, Hao Zhang

TL;DR
Lookahead Reasoning enhances speculative decoding by introducing a step-level parallelism layer, significantly increasing decoding speed while maintaining answer quality across various benchmarks.
Contribution
This paper introduces Lookahead Reasoning, a novel method that combines step-level and token-level parallelism to surpass existing speculative decoding speed limits.
Findings
Speeds up speculative decoding from 1.4x to 2.1x.
Maintains answer quality across multiple benchmarks.
Scales better with additional GPU throughput.
Abstract
Reasoning models excel by generating long chain-of-thoughts, but decoding the resulting thousands of tokens is slow. Token-level speculative decoding (SD) helps, but its benefit is capped, because the chance that an entire -token guess is correct falls exponentially as grows. This means allocating more compute for longer token drafts faces an algorithmic ceiling -- making the speedup modest and hardware-agnostic. We raise this ceiling with Lookahead Reasoning, which exploits a second, step-level layer of parallelism. Our key insight is that reasoning models generate step-by-step, and each step needs only to be semantically correct, not exact token matching. In Lookahead Reasoning, a lightweight draft model proposes several future steps; the target model expands each proposal in one batched pass, and a verifier keeps semantically correct steps while letting the target…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLogic, Reasoning, and Knowledge · Advanced Algebra and Logic · Semantic Web and Ontologies
