Efficient Adaptive Rejection Sampling for Accelerating Speculative Decoding in Large Language Models

Chendong Sun; Ali Mao; Lei Xu; mingmin Chen

arXiv:2512.13194·cs.CL·December 18, 2025

Efficient Adaptive Rejection Sampling for Accelerating Speculative Decoding in Large Language Models

Chendong Sun, Ali Mao, Lei Xu, mingmin Chen

PDF

Open Access

TL;DR

This paper presents EARS, an adaptive rejection sampling method that improves speculative decoding efficiency in large language models by dynamically adjusting acceptance thresholds based on model uncertainty, leading to significant throughput gains.

Contribution

EARS introduces a novel adaptive threshold mechanism for rejection sampling in speculative decoding, reducing random rejections without altering model architectures.

Findings

01

Achieves up to 18.12% throughput increase in inference.

02

Maintains high accuracy with only 0.84% drop on GSM8K.

03

Seamlessly integrates into existing frameworks.

Abstract

Speculative Decoding is a prominent technique for accelerating the autoregressive inference of large language models (LLMs) by employing a fast draft model to propose candidate token sequences and a large target model to verify them in parallel. However, its core component -- the rejection sampling mechanism -- relies on a fixed, context-independent random threshold. This leads to a significant "random rejection" problem in high-uncertainty generation scenarios, where plausible candidate tokens are frequently rejected due to random chance, undermining inference efficiency. This paper introduces Efficient Adaptive Rejection Sampling (EARS), a novel method that dynamically adjusts the acceptance threshold by incorporating the target model's own predictive uncertainty, measured as 1 - max(P_target). By introducing a tolerance term proportional to this uncertainty, EARS intelligently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech Recognition and Synthesis · Natural Language Processing Techniques