TL;DR
SpecKV introduces an adaptive speculation length controller for speculative decoding in LLMs, optimizing token proposal per step based on draft model signals and compression levels, significantly improving efficiency.
Contribution
We develop a lightweight adaptive controller that dynamically selects speculation length using draft model signals, tailored to compression regimes, enhancing decoding efficiency.
Findings
SpecKV achieves 56.0% improvement over fixed speculation length baseline.
Draft model confidence and entropy strongly predict acceptance rates.
Adaptive speculation length varies across compression levels, optimizing token proposal.
Abstract
Speculative decoding accelerates large language model (LLM) inference by using a small draft model to propose candidate tokens that a larger target model verifies. A critical hyperparameter in this process is the speculation length , which determines how many tokens the draft model proposes per step. Nearly all existing systems use a fixed (typically 4), yet empirical evidence suggests that the optimal value varies across task types and, crucially, depends on the compression level applied to the target model. In this paper, we present SpecKV, a lightweight adaptive controller that selects per speculation step using signals extracted from the draft model itself. We profile speculative decoding across 4 task categories, 4 speculation lengths, and 3 compression levels (FP16, INT8, NF4), collecting 5,112 step-level records with per-step acceptance rates, draft…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
