Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

Ngoc Bui; Hieu Trung Nguyen; Arman Cohan; Rex Ying

arXiv:2605.09649·cs.LG·May 12, 2026

Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

Ngoc Bui, Hieu Trung Nguyen, Arman Cohan, Rex Ying

PDF

1 Repo 10 Models

TL;DR

This paper proposes a learnable, global KV cache eviction method that selectively retains useful tokens, reducing memory usage and improving long-context reasoning in language and vision-language tasks.

Contribution

It introduces a retention-based eviction mechanism with lightweight gates and a unified scoring system, enhancing long-context inference beyond simple cache compression.

Findings

01

Substantially reduces KV memory in long-context tasks.

02

Matches or surpasses full-cache inference performance.

03

Improves reasoning by retaining relevant tokens more effectively.

Abstract

The key-value (KV) cache is a major bottleneck in long-context inference, where memory and computation grow with sequence length. Existing KV eviction methods reduce this cost but typically degrade performance relative to full-cache inference. Our key insight is that full-cache attention is not always optimal: in long contexts, irrelevant tokens can dilute attention away from useful evidence, so selective, learnable eviction can improve generation rather than merely approximate the full cache. We introduce a global retention-based KV eviction method that learns each token's future utility under a unified memory budget. Lightweight retention gates assign utility scores to cached KV entries, and a shared final scoring projection calibrates these scores across all layers and heads. This enables a single global eviction policy in which tokens from different layers, heads, and modalities…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ngocbh/trimkv
github

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.