FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

Yaojie Zhang; Jianuo Huang; Junlong Ke; Yuhang Han; Yongji Long; Tianchen Zhao; Biqing Qi; Linfeng Zhang

arXiv:2605.20022·cs.CL·May 20, 2026

FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

Yaojie Zhang, Jianuo Huang, Junlong Ke, Yuhang Han, Yongji Long, Tianchen Zhao, Biqing Qi, Linfeng Zhang

PDF

TL;DR

FlexDraft introduces a flexible, lossless speculative decoding framework that adapts to batch size variations, improving inference speed and accuracy in large language models through attention tuning and bonus-guided calibration.

Contribution

It proposes novel techniques—Attention Tuning, Bonus-guided Calibration, and Flex Decoding—to enhance parallel speculative decoding efficiency and robustness without quality loss.

Findings

01

Achieves high throughput at large batch sizes without quality degradation.

02

Effectively mitigates draft verification mismatch caused by bonus token uncertainty.

03

Maintains high draft quality with minimal additional training or computational overhead.

Abstract

Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting between drafting and verification, and repeated exchange of intermediate states further increases memory access overhead. Parallel speculative decoding addresses this limitation by performing drafting and verification within a single target forward pass, allowing future drafts to be prepared while current candidates are being verified. Although effective at small batch sizes, existing parallel speculative decoding methods either require costly continual pretraining with quality degradation or suffer from low acceptance rates. More importantly, this paradigm inherently suffers from uncertainty in both the bonus…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.