Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference

Xuwen Zhou; Fangxin Liu; Chao Wang; Xiao Zheng; Hao Zheng; Min He; Li Jiang; Haibing Guan

arXiv:2604.13634·cs.CL·April 16, 2026

Calibrated Speculative Decoding: Frequency-Guided Candidate Selection for Efficient Inference

Xuwen Zhou, Fangxin Liu, Chao Wang, Xiao Zheng, Hao Zheng, Min He, Li Jiang, Haibing Guan

PDF

TL;DR

Calibrated Speculative Decoding (CSD) is a training-free framework that improves inference speed and accuracy in large language models by intelligently recovering discarded tokens using frequency guidance and probability checks.

Contribution

CSD introduces a novel, training-free approach with lightweight modules for better token recovery, outperforming existing methods in speed and accuracy across diverse models.

Findings

01

Achieves up to 2.33x throughput speedup.

02

Maintains model accuracy across various tasks.

03

Enhances performance on complex reasoning datasets.

Abstract

Speculative decoding accelerates autoregressive generation by letting draft tokens bypass full verification, but conventional frameworks suffer from frequent false rejections, particularly when draft models produce semantically correct but lexically divergent outputs. In this paper, we present Calibrated Speculative Decoding (CSD), a training-free framework that recovers valid tokens discarded by standard verification. Guided by the principle of "Frequency-Guided Candidate Selection and Probability-Guarded Acceptance," CSD incorporates two lightweight modules: Online Correction Memory, which aggregates historical rejections to propose recurring divergence patterns as rescue candidates, and Semantic Consistency Gating, which verifies candidate admissibility using probability ratios instead of exact token matching. Our evaluation across diverse large language models demonstrates that CSD…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.