TL;DR
This paper proposes a learnable, global KV cache eviction method that selectively retains useful tokens, reducing memory usage and improving long-context reasoning in language and vision-language tasks.
Contribution
It introduces a retention-based eviction mechanism with lightweight gates and a unified scoring system, enhancing long-context inference beyond simple cache compression.
Findings
Substantially reduces KV memory in long-context tasks.
Matches or surpasses full-cache inference performance.
Improves reasoning by retaining relevant tokens more effectively.
Abstract
The key-value (KV) cache is a major bottleneck in long-context inference, where memory and computation grow with sequence length. Existing KV eviction methods reduce this cost but typically degrade performance relative to full-cache inference. Our key insight is that full-cache attention is not always optimal: in long contexts, irrelevant tokens can dilute attention away from useful evidence, so selective, learnable eviction can improve generation rather than merely approximate the full cache. We introduce a global retention-based KV eviction method that learns each token's future utility under a unified memory budget. Lightweight retention gates assign utility scores to cached KV entries, and a shared final scoring projection calibrates these scores across all layers and heads. This enables a single global eviction policy in which tokens from different layers, heads, and modalities…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ngocbh/TrimKV-Qwen3-4B-Mathmodel· 48 dl48 dl
- 🤗ngocbh/TrimKV-Qwen3-8B-Mathmodel· 42 dl42 dl
- 🤗ngocbh/TrimKV-Qwen3-14B-Mathmodel· 33 dl33 dl
- 🤗ngocbh/TrimKV-Qwen3-1.7B-Mathmodel· 35 dl35 dl
- 🤗ngocbh/TrimKV-Qwen3-4B-Instruct-2507model· 31 dl31 dl
- 🤗ngocbh/TrimKV-Phi-3-mini-128k-instructmodel· 34 dl34 dl
- 🤗ngocbh/DBTrimKV-Qwen3-4B-Mathmodel· 56 dl56 dl
- 🤗ngocbh/DBTrimKV-Qwen3-4B-Instruct-2507model· 40 dl40 dl
- 🤗ngocbh/DBTrimKV-Qwen3-VL-8B-Thinkingmodel· 129 dl129 dl
- 🤗ngocbh/DBTrimKV-Qwen3-VL-4B-Instructmodel· 68 dl68 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
