Learning to Evict from Key-Value Cache

Luca Moschella; Laura Manduchi; Ozan Sener

arXiv:2602.10238·cs.CL·February 12, 2026

Learning to Evict from Key-Value Cache

Luca Moschella, Laura Manduchi, Ozan Sener

PDF

Open Access 3 Reviews

TL;DR

This paper introduces KV Policy, a reinforcement learning framework that learns to efficiently manage Key-Value cache eviction in large language models, significantly improving performance and generalization over heuristic methods.

Contribution

The paper proposes a novel RL-based approach for cache eviction in LLMs, training lightweight per-head agents to predict token utility without modifying the model or adding inference overhead.

Findings

01

KVP outperforms heuristic baselines on RULER and OASST2-4k benchmarks.

02

KVP generalizes well to downstream tasks and longer contexts.

03

The RL approach effectively predicts future token utility for cache management.

Abstract

The growing size of Large Language Models (LLMs) makes efficient inference challenging, primarily due to the memory demands of the autoregressive Key-Value (KV) cache. Existing eviction or compression methods reduce cost but rely on heuristics, such as recency or past attention scores, which serve only as indirect proxies for a token's future utility and introduce computational overhead. We reframe KV cache eviction as a reinforcement learning (RL) problem: learning to rank tokens by their predicted usefulness for future decoding. To this end, we introduce KV Policy (KVP), a framework of lightweight per-head RL agents trained on pre-computed generation traces using only key and value vectors. Each agent learns a specialized eviction policy guided by future utility, which evaluates the quality of the ranking across all cache budgets, requiring no modifications to the underlying LLM or…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 4

Strengths

S1. This paper tackles an important problem of KV cache compression by eviction. S2. This paper proposes to learn to evict the KV states, which is less explore in the area. S3. The paper is well written and structured.

Weaknesses

W1. Comparisons with recent sparse kv cache retrieval approaches, e.g., IceCache, ArkVale, MagicPig, InfiniGen, should also be included in addition to the kv cache eviction approaches. W2. More backbone LLMs should be included. Qwen2.5 should be upgraded to Qwen3-8B. At least Llama3.1-8B should be included for the different variety of the models. One of a medium size LLMs, i.e., ~32B, should also be included to demonstrate the scalability of the proposed approach. W3. More long-context benchma

Reviewer 02Rating 4Confidence 3

Strengths

The paper provides a principled, budget-agnostic formulation of KV eviction and a practical, lightweight per-head policy that uses only cache-local features. Due to this, the technique is fast, query-free, and easy to deploy. Empirically, it outperforms strong heuristics and attention-aware baselines across cache sizes and tasks, showing robust generalization.

Weaknesses

Training optimizes a proxy based on future attention computed from offline Q/K/V traces. This can misalign with downstream utility and requires precomputing and storing full-sequence Q/K/V (attention matrices omitted only due to size).

Reviewer 03Rating 4Confidence 4

Strengths

The formulation of KV cache eviction as a learning problem is original. The authors prove that, under two reasonable assumptions, the subset selection problem can be reduced to a ranking problem, which can then be formulated as an RL problem. It is interesting that the policy requires only the keys, values, and their positions as input and no attention information. It is a strength of KVP that it can be pretained and does not incur any overhead at inference time. I appreciate the ablation stud

Weaknesses

The experiments show that KVP achieves the best accuracy or perplexity on RULER and OASST2 for most cache sizes (fig 2) and competitive accuracy on the downstream tasks BOOLQ and ARC CHALLENGE (fig 3). However, there is typically a tradeoff between accuracy and latency and storage space. Therefore, the authors should also report the latency and storage space of the various tested methods. The authors have only performed experiments with a version of Owen, and I would like to see whether their r

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare