KnapSpec: Self-Speculative Decoding via Adaptive Layer Selection as a Knapsack Problem
Seongjin Cha, Gyuwan Kim, Dongsu Han, Tao Yang, Insu Han

TL;DR
KnapSpec introduces an adaptive, training-free decoding framework that formulates draft model selection as a knapsack problem, optimizing inference speed for large language models by dynamically balancing computational costs and accuracy.
Contribution
It reformulates self-speculative decoding as a knapsack problem and provides a theoretical basis using cosine similarity as a proxy for token acceptance rate.
Findings
Achieves up to 1.47x speedup on benchmarks.
Outperforms existing SSD methods consistently.
Maintains high drafting faithfulness without extra training.
Abstract
Self-speculative decoding (SSD) accelerates LLM inference by skipping layers to create an efficient draft model, yet existing methods often rely on static heuristics that ignore the dynamic computational overhead of attention in long-context scenarios. We propose KnapSpec, a training-free framework that reformulates draft model selection as a knapsack problem to maximize tokens-per-time throughput. By decoupling Attention and MLP layers and modeling their hardware-specific latencies as functions of context length, KnapSpec adaptively identifies optimal draft configurations on the fly via a parallel dynamic programming algorithm. Furthermore, we provide the first rigorous theoretical analysis establishing cosine similarity between hidden states as a mathematically sound proxy for the token acceptance rate. This foundation allows our method to maintain high drafting faithfulness while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Data Storage Technologies · Parallel Computing and Optimization Techniques · Network Packet Processing and Optimization
