Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning

Tiancheng Su; Meicong Zhang; Guoxiu He

arXiv:2512.23765·cs.CL·January 1, 2026

Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning

Tiancheng Su, Meicong Zhang, Guoxiu He

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Entropy-Aware Speculative Decoding (EASD), a training-free method that improves LLM reasoning by dynamically penalizing low-confidence predictions, enabling surpassing target model performance with efficiency comparable to standard SD.

Contribution

EASD enhances speculative decoding by incorporating entropy-based penalties, allowing the target LLM to outperform the draft model without additional training.

Findings

01

EASD outperforms existing SD methods on reasoning benchmarks.

02

EASD often surpasses the performance of the target LLM.

03

EASD maintains efficiency comparable to standard SD.

Abstract

Speculative decoding (SD) accelerates large language model (LLM) reasoning by using a small draft model to generate candidate tokens, which the target LLM either accepts directly or regenerates upon rejection. However, excessive alignment between the draft and target models constrains SD to the performance of the target LLM. To address this limitation, we propose Entropy-Aware Speculative Decoding (EASD), a training-free enhancement. Building on standard SD, EASD incorporates a dynamic entropy-based penalty. At each decoding step, we employ the entropy of the sampling distribution to quantify model uncertainty. When both models exhibit high entropy with substantial overlap among their top-N predictions, the corresponding token is rejected and re-sampled by the target LLM. This penalty prevents low-confidence errors from propagating. By incorporating draft-model verification, EASD…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper is well-written, with a clear motivation and contribution. 2. The proposed method is simple to apply. 3. The experiments are thorough, with clear results to show the performance and efficiency benefit from EASD.

Weaknesses

1. The proposed method lacks of theoretical support. Both baselines, SD and RSD, are well supported by the theoretical justification. 2. The efficiency comparison seems unfair. Only the number of generated tokens are shown. It's suggested to show how much tokens are generated by the draft and target models, separately. Compared to SD, EASD has two more conditions. I believe more tokens should be rejected and regenerated by the target model.

Reviewer 02Rating 2Confidence 5

Strengths

The paper offers a fresh and elegant extension of speculative decoding by incorporating entropy as a dynamic control signal. Its training free formulation and focus on uncertainty driven collaboration between large and small models distinguish it from prior reward or alignment based methods.

Weaknesses

The paper attributes Reward Guided Speculative Decoding (RSD) to Li et al., 2025a, which actually refers to Reward Shifted Speculative Sampling (SSS). The correct citation should be Liao et al., 2025 (arXiv:2501.19324). This misattribution may mislead readers about the baseline implementation and the conceptual lineage of RSD. The authors should revise all mentions, tables, and references accordingly and clarify whether their RSD baseline follows Liao et al.’s procedure or the SSS variant. Whil

Reviewer 03Rating 2Confidence 4

Strengths

- The idea of EASD is simple and effective: it is conceptually clear and a training-free extension to speculative decoding that leverages entropy and distributional overlap to improve reasoning quality, showing improvement in performance without much increasing complexity or computational cost. - EASD consistently outperforms both standard and reward-guided speculative decoding (RSD) across diverse reasoning benchmarks. - The paper is easy to read and understand, with analyses help interpret the

Weaknesses

1. For methodology design, while leveraging entropy and top-n is a signal, the choice of the threshold is pretty heuristic. For entropy, the threshold has to be pre-defined through computation on a validation set, while for top-n, it is pre-defined by a heuristic. This reliance on fixed thresholds raises concerns about the method’s robustness and generalizability across different model pairs or domains. Since entropy distributions and token overlap patterns can vary significantly between archite

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods