HeteroSpec: Leveraging Contextual Heterogeneity for Efficient Speculative Decoding
Siran Liu, Yang Ye, Qianchao Zhu, Zane Cao, Yongchao He

TL;DR
HeteroSpec introduces a heterogeneity-aware speculative decoding method that adaptively allocates verification efforts based on candidate uncertainty, significantly improving decoding speed for large language models without retraining.
Contribution
It proposes a novel heterogeneity-adaptive framework that estimates verification complexity and dynamically tunes decoding parameters, enhancing efficiency over existing methods.
Findings
Achieves 4.24× average decoding speedup over state-of-the-art methods.
Maintains exact output distributions while improving efficiency.
Requires no model retraining and is compatible with other inference optimizations.
Abstract
Autoregressive decoding inherently limits the inference throughput of Large Language Model (LLM) due to its sequential dependency. Speculative decoding mitigates this by verifying multiple predicted tokens in parallel, but its efficiency remains constrained by what we identify as verification heterogeneity -- the uneven difficulty of verifying different speculative candidates. In practice, a small subset of high-confidence predictions accounts for most successful verifications, yet existing methods treat all candidates uniformly, leading to redundant computation. We present HeteroSpec, a heterogeneity-adaptive speculative decoding framework that allocates verification effort in proportion to candidate uncertainty. HeteroSpec estimates verification complexity using a lightweight entropy-based quantifier, partitions candidates via a data-driven stratification policy, and dynamically tunes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsPruning
