Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

Yuhao Shen; Tianyu Liu; Xinyi Hu; Quan Kong; Baolin Zhang; Jun Dai; Jun Zhang; Shuang Ge; Lei Chen; Yue Li; Mingcheng Wan; and Cong Wang

arXiv:2605.20104·cs.LG·May 20, 2026

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

Yuhao Shen, Tianyu Liu, Xinyi Hu, Quan Kong, Baolin Zhang, Jun Dai, Jun Zhang, Shuang Ge, Lei Chen, Yue Li, Mingcheng Wan, and Cong Wang

PDF

TL;DR

This paper introduces Graft, a novel resource allocation framework that combines pruning and retrieval to accelerate large language model inference by efficiently constructing draft trees.

Contribution

Graft is a training-free, lossless framework that improves speculative decoding by attaching predictive tokens into pruned draft trees, enhancing speed and coverage.

Findings

01

Graft achieves up to 5.41× speedup on short-context benchmarks.

02

It improves average speedup over EAGLE-3 by up to 21.8%.

03

Graft establishes a new Pareto frontier in model inference efficiency.

Abstract

Speculative decoding (SD) accelerates large language model inference by leveraging a draft-then-verify paradigm. To maximize the acceptance rate, recent methods construct expansive draft trees, which unfortunately incur severe VRAM bandwidth and computational overheads that bottleneck end-to-end speedups. While dynamic-depth pruning can reduce this latency by removing marginal branches, it also discards potentially valid candidates, preventing the acceptance rate from reaching the upper bound of dense trees. In this paper, we identify a critical opportunity in resource allocation: the transition from dense to pruned drafting frees up significant computational budget. To break this Pareto tradeoff, we introduce Graft, a compensation framework that couples pruning and retrieval as mutually reinforcing operations. Pruning supplies sufficient budget for retrieval, while retrieval…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.