Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

Ngoc Bui; Shubham Sharma; Simran Lamba; Saumitra Mishra; Rex Ying

arXiv:2512.03324·cs.LG·March 3, 2026

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying

PDF

Open Access 7 Models 3 Reviews

TL;DR

TRIM-KV introduces a learnable token retention mechanism that efficiently manages memory in long-horizon LLM inference, outperforming existing methods and providing interpretability insights.

Contribution

It proposes a lightweight, trainable retention gate for token importance, improving memory management and model performance in long-context tasks.

Findings

01

Outperforms strong eviction baselines across multiple benchmarks.

02

Surpasses full-cache models in low-memory regimes.

03

Aligns with human intuition and reveals interpretability insights.

Abstract

Memory and computation remain core bottlenecks in long-horizon LLM inference due to the quadratic cost of self-attention and the ever-growing key-value (KV) cache. Existing strategies for memory-bounded inference, such as quantization, offloading, or heuristic KV eviction, either incur high orchestration costs or rely on unreliable attention-based proxies of importance. We propose TRIM-KV, a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate. Each gate predicts a scalar retention score that decays over time, reflecting the long-term utility of the token for a specific layer and head. Tokens with low scores are evicted when the memory budget is exceeded, ensuring that the cache always contains the most critical tokens. TRIM-KV is trained efficiently through distillation from a frozen LLM combined with a capacity loss, requiring…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 4

Strengths

1. The paper introduces a trainable approach to KV cache eviction that learns token-level importance scores through retention gates. This design allows the model to identify and retain intrinsically important tokens 2. The authors conduct experiments across multiple mathematical reasoning tasks and show competitive results.

Weaknesses

1. The paper lacks comparison with LocRet, a highly relevant recent work that also computes importance scores for each token and discards low-importance tokens to reduce memory overhead. Locret: Enhancing eviction in long-context LLM inference with trained retaining heads 2. The paper would benefit from comprehensive evaluation on established long-context understanding benchmarks such as RULER and LongBench-V2 to validate the method's effectiveness. 3. The use of exponential decay for comput

Reviewer 02Rating 4Confidence 4

Strengths

1. Novel idea, fits cleanly into existing models, and works well without heavy changes. 2. Shows solid gains across tasks, occasionally even beating full-cache runs. 3. Lightweight enough to feel practical, not just academic.

Weaknesses

Since it’s still a trained model, it raises the natural question of why this isn’t folded into normal model training, and the paper doesn’t address that integration path. It also doesn’t discuss how the learned decay interacts with positional encoding, which leaves some ambiguity around whether it’s learning true importance or just reinforcing a recency style bias.

Reviewer 03Rating 8Confidence 4

Strengths

1. Novel and well-motivated idea: The paper identifies a fundamental limitation of attention-based eviction — that “recent attention ≠ importance” — and replaces it with a predictive, intrinsic importance estimation per token. This shift from reactive to proactive cache management is conceptually elegant and well-justified. 2. Brain-inspired design: Modeling token importance decay via an exponential forgetting curve connects the method to cognitive science (Ebbinghaus), offering both theoretical

Weaknesses

Limited exploration of dynamic budgets: The current method assumes a fixed memory budget M. It would be interesting to explore adaptive budgets (e.g., varying by layer, head, or task), especially since the authors mention this as future work.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Machine Learning in Materials Science