Draft-based Approximate Inference for LLMs

Kevin Galim; Ethan Ewer; Wonjun Kang; Minjae Lee; Hyung Il Koo; Kangwook Lee

arXiv:2506.08373·cs.CL·February 3, 2026

Draft-based Approximate Inference for LLMs

Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, Kangwook Lee

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces a unified framework using small draft models for more accurate approximate inference in long-context LLMs, improving efficiency while maintaining high accuracy.

Contribution

It proposes novel importance estimation techniques using lookahead with draft models, including SpecKV, SpecPC, and their combination, advancing long-context LLM inference methods.

Findings

01

Achieves higher accuracy than existing methods on long-context benchmarks.

02

Maintains efficiency in memory, latency, and throughput.

03

Provides theoretical and empirical justification for lookahead importance estimation.

Abstract

Optimizing inference for long-context large language models (LLMs) is increasingly important due to the quadratic compute and linear memory cost of Transformers. Existing approximate inference methods, including key-value (KV) cache dropping, sparse attention, and prompt compression, typically rely on coarse predictions of token or KV pair importance. We unify and extend recent work by introducing a framework for approximate LLM inference that leverages small draft models to more accurately predict token and KV pair importance. We provide novel theoretical and empirical analyses justifying lookahead-based importance estimation techniques. Within this framework, we present: (i) SpecKV, the first method to use lookahead with a small draft model to enable precise KV cache dropping; (ii) SpecPC, which leverages draft model attention activations to identify and discard less important prompt…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. First integration of draft-model lookahead into KV dropping and prompt compression, with theoretical justification. 2. Strong empirical gains across diverse benchmarks, models, and compression budgets. 3. Clear motivation, concise algorithms, and well-presented results.

Weaknesses

1. Lacks analysis of importance score differences with/without lookahead 2. Limited breakdown of the latency trade-off 3. Unclear whether SpecKV and SpecPC can be effectively combined.

Reviewer 02Rating 8Confidence 3

Strengths

1. By using a draft model to estimate the importance of tokens in the KV cache and prompt, the method achieves strong performance under controllable complexity. 2. This work provide a clear theoretical analysis, demonstrating how embedding errors influence KV importance estimation errors (Theorem 1), and how output approximation under RIP or more general assumptions can upper bound attention approximation errors (Theorems 2 and 3). 3. The experiments on LongBench and RULER Benchmarks are solid,

Weaknesses

1. For different input embeddings, are there any limitations to the applicability of Theorem 2? 2. There appear to be some typo errors in Table 2.

Reviewer 03Rating 4Confidence 3

Strengths

1. This paper provides theoretical (Theorems 1 and 2) and experimental evidence to support the effectiveness of the "look ahead" based importance estimation. 2. This framework unifies and extends the idea of using approximate future information to improve token importance estimation. 3. The paper claims that its method achieves the current state-of-the-art accuracy in the long context benchmark under the constraint of a fixed KV cache or prompt size. 4. Even if a weak draft model is used, th

Weaknesses

1. The core of the whole framework is to use the draft model to approximate the behavior of the target model. Both theoretical analysis (Theorem 1) and experimental results (Fig. 10) show that the accuracy of the draft model directly affects the final performance. If a draft model that is small (low overhead) and similar enough to the target model (high accuracy) cannot be found, the effect of SpecKV and SpecPC may be compromised. 2. Compared with methods such as SnapKV, which only pre-fills an

Code & Models

Repositories

furiosa-ai/draft-based-approx-llm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare

MethodsSoftmax · Attention Is All You Need