HeteroSpec: Leveraging Contextual Heterogeneity for Efficient Speculative Decoding

Siran Liu; Yang Ye; Qianchao Zhu; Zane Cao; Yongchao He

arXiv:2505.13254·cs.CL·October 27, 2025

HeteroSpec: Leveraging Contextual Heterogeneity for Efficient Speculative Decoding

Siran Liu, Yang Ye, Qianchao Zhu, Zane Cao, Yongchao He

PDF

TL;DR

HeteroSpec introduces a heterogeneity-aware speculative decoding method that adaptively allocates verification efforts based on candidate uncertainty, significantly improving decoding speed for large language models without retraining.

Contribution

It proposes a novel heterogeneity-adaptive framework that estimates verification complexity and dynamically tunes decoding parameters, enhancing efficiency over existing methods.

Findings

01

Achieves 4.24× average decoding speedup over state-of-the-art methods.

02

Maintains exact output distributions while improving efficiency.

03

Requires no model retraining and is compatible with other inference optimizations.

Abstract

Autoregressive decoding inherently limits the inference throughput of Large Language Model (LLM) due to its sequential dependency. Speculative decoding mitigates this by verifying multiple predicted tokens in parallel, but its efficiency remains constrained by what we identify as verification heterogeneity -- the uneven difficulty of verifying different speculative candidates. In practice, a small subset of high-confidence predictions accounts for most successful verifications, yet existing methods treat all candidates uniformly, leading to redundant computation. We present HeteroSpec, a heterogeneity-adaptive speculative decoding framework that allocates verification effort in proportion to candidate uncertainty. HeteroSpec estimates verification complexity using a lightweight entropy-based quantifier, partitions candidates via a data-driven stratification policy, and dynamically tunes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsPruning