LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

Jinwoo Ahn; Ingyu Seong; Akhil Kedia; Junhan Kim; Hyemi Jang; Kangwook Lee; Yongkweon Jeon

arXiv:2603.10899·cs.LG·March 12, 2026

LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

Jinwoo Ahn, Ingyu Seong, Akhil Kedia, Junhan Kim, Hyemi Jang, Kangwook Lee, Yongkweon Jeon

PDF

Open Access 3 Reviews

TL;DR

LookaheadKV introduces a lightweight, parameter-efficient framework for key-value cache eviction in large language models that predicts importance scores accurately without expensive draft generation, improving efficiency and speed.

Contribution

It presents a novel, low-overhead eviction method that enhances cache management in LLMs by accurately predicting importance scores without costly future response generation.

Findings

01

Outperforms recent baselines in long-context understanding tasks.

02

Reduces eviction cost by up to 14.5 times.

03

Achieves faster time-to-first-token in inference.

Abstract

Transformer-based large language models (LLMs) rely on key-value (KV) caching to avoid redundant computation during autoregressive inference. While this mechanism greatly improves efficiency, the cache size grows linearly with the input sequence length, quickly becoming a bottleneck for long-context tasks. Existing solutions mitigate this problem by evicting prompt KV that are deemed unimportant, guided by estimated importance scores. Notably, a recent line of work proposes to improve eviction quality by "glimpsing into the future", in which a draft generator produces a surrogate future response approximating the target model's true response, and this surrogate is subsequently used to estimate the importance of cached KV more accurately. However, these approaches rely on computationally expensive draft generation, which introduces substantial prefilling overhead and limits their…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. Presentation is overall clear and problem is motivated by the accuracy overhead trade-off from other draft-based methods. The latency problem is directly tackled through a learned approach. LookaheadKV provides an interesting application of parameter-efficient fine-tuning methodology for KV cache purposes. 2. Experiments performed on various model families and sizes, and also long-context benchmarks. Inclusion of LongProc provides some support that LookaheadKV can preserve the model's reasoni

Weaknesses

1. All ground-truth attention scores are generated using greedy decoding. As a result, lookaheadKV is effectively trained to predict attention patterns of a greedy future only, which may limit its applicability. In many practical applications, LLM inference is not deterministic and benefits immensely from stochastic sampling strategies such as temperature scaling to produce more diverse outputs. The attention patterns can differ substantially, which could lead to degradation in eviction quality

Reviewer 02Rating 4Confidence 5

Strengths

1. Strong empirical results with broad coverage: Outperforms baselines on multiple benchmarks 2. Comprehensive ablations and analysis: Includes detailed experiments on number of lookahead tokens, LoRA layer coverage, training context length, and budget scaling

Weaknesses

1. Unclear generalization to stochastic decoding: The method is trained and evaluated primarily under greedy decoding, assuming deterministic next-token prediction. It remains untested under sampling-based decoding 2. Absence of multi-turn or instruction-following benchmarks: Evaluations mostly use single-turn or synthetic long-context datasets. Multi-turn reasoning or conversational tasks, where future tokens are highly context-dependent, are missing 3. Lack of Subtask-Level Analysis on LongBen

Reviewer 03Rating 4Confidence 4

Strengths

The central idea is interesting: use a pretrained module with a small LoRA applied on a short learnable window to substitute for running an explicit generate step. This reduces prefilling cost while still capturing signals about future attention. The method is simple to integrate, has low inference overhead, and shows consistent gains in low-budget regimes. The empirical results cover several model families and tasks, and the latency accounting is practical.

Weaknesses

The presentation has gaps. I can understand the LoRA training and the alignment objective, but I do not understand precisely how the lookahead embeddings are obtained. It is not fully clear whether these lookahead tokens are new learned embeddings, adapted from existing vocabulary embeddings, or derived from another module. The paper would benefit from a precise definition of the parameterization, initialization, and update path of these embeddings.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Machine Learning in Healthcare