SpecTr-GBV: Multi-Draft Block Verification Accelerating Speculative Decoding
Yijun Lin, Jinhao Sheng, Qingyue Cai, Feng Zhou

TL;DR
SpecTr-GBV introduces a unified speculative decoding method that combines multi-draft and block verification, achieving faster inference in language models with improved efficiency and maintained quality.
Contribution
It unifies multi-draft and greedy block verification into a single optimal transport framework, enhancing theoretical and empirical decoding efficiency.
Findings
Achieves superior speedup over baselines.
Significantly higher block efficiency while preserving quality.
Theoretically proven to reach optimal acceptance length with more drafts.
Abstract
Autoregressive language models suffer from high inference latency due to their sequential decoding nature. Speculative decoding (SD) mitigates this by employing a lightweight draft model to propose candidate tokens, which are selectively verified by a larger target model. While existing methods either adopt multi-draft strategies to increase acceptance rates or block verification techniques to jointly verify multiple tokens, they remain limited by treating these improvements in isolation. In this work, we propose SpecTr-GBV, a novel SD method that unifies multi-draft and greedy block verification (GBV) into a single framework. By formulating the verification step as an optimal transport problem over draft and target token blocks, SpecTr-GBV improves both theoretical efficiency and empirical performance. We theoretically prove that SpecTr-GBV achieves the optimal expected acceptance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
