Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding
Yuhao Shen, Tianyu Liu, Xinyi Hu, Quan Kong, Baolin Zhang, Jun Dai, Jun Zhang, Shuang Ge, Lei Chen, Yue Li, Mingcheng Wan, and Cong Wang

TL;DR
This paper introduces Graft, a novel resource allocation framework that combines pruning and retrieval to accelerate large language model inference by efficiently constructing draft trees.
Contribution
Graft is a training-free, lossless framework that improves speculative decoding by attaching predictive tokens into pruned draft trees, enhancing speed and coverage.
Findings
Graft achieves up to 5.41× speedup on short-context benchmarks.
It improves average speedup over EAGLE-3 by up to 21.8%.
Graft establishes a new Pareto frontier in model inference efficiency.
Abstract
Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM bandwidth and computational overheads that bottleneck end-to-end speedups. While dynamic-depth pruning can reduce this latency by removing marginal branches, it also discards potentially valid candidates, preventing the acceptance rate from reaching the upper bound of dense trees. In this paper, we identify a critical opportunity in resource allocation: the transition from dense to pruned drafting frees up significant computational budget. To break this Pareto tradeoff, we introduce Graft, a compensation framework that couples pruning and retrieval as mutually reinforcing operations. Pruning supplies sufficient budget for retrieval, while retrieval…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
