SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting
Weijie Shi, Qiang Xu, Fan Deng, Yaguang Wu, Jiarun Liu, Yehong Xu, Hao Chen, Jia Zhu, Jiajie Xu, Xiangjun Huang, Jian Yang, Xiaofang Zhou

TL;DR
SpecBlock introduces a novel block-iterative speculative decoding method that combines path dependence with efficient drafting, significantly improving LLM inference speed and adaptability.
Contribution
It proposes a new block-iterative drafter with mechanisms for path dependence, dynamic tree growth, and cost-aware adaptation, enhancing speed and efficiency over existing methods.
Findings
Achieves 8-13% mean speedup over EAGLE-3 at 44-52% of its drafting cost.
Cost-aware adaptation extends speedup to 11-19%.
Improves inference efficiency by combining path dependence with dynamic tree drafting.
Abstract
Speculative decoding accelerates LLM inference by drafting a tree of candidate continuations and verifying it in one target forward. Existing drafters fall into two camps with opposite weaknesses. Autoregressive drafters such as EAGLE-3 preserve dependence along each draft path but call the drafter once per tree depth, making drafting a non-trivial share of per-iteration latency. Parallel drafters cut drafter calls by predicting multiple future positions in one forward, but each position is predicted without seeing the others, producing paths the verifier rejects. In this paper, we propose SpecBlock, a block-iterative drafter that combines path dependence with cheap drafting. Each drafter forward produces K dependent positions and we call this a block. The draft tree grows through repeated block expansions. Two mechanisms explicitly carry path dependence to keep later draft positions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
