Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying

TL;DR
TRIM-KV introduces a learnable token retention mechanism that efficiently manages memory in long-horizon LLM inference, outperforming existing methods and providing interpretability insights.
Contribution
It proposes a lightweight, trainable retention gate for token importance, improving memory management and model performance in long-context tasks.
Findings
Outperforms strong eviction baselines across multiple benchmarks.
Surpasses full-cache models in low-memory regimes.
Aligns with human intuition and reveals interpretability insights.
Abstract
Memory and computation remain core bottlenecks in long-horizon LLM inference due to the quadratic cost of self-attention and the ever-growing key-value (KV) cache. Existing strategies for memory-bounded inference, such as quantization, offloading, or heuristic KV eviction, either incur high orchestration costs or rely on unreliable attention-based proxies of importance. We propose TRIM-KV, a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate. Each gate predicts a scalar retention score that decays over time, reflecting the long-term utility of the token for a specific layer and head. Tokens with low scores are evicted when the memory budget is exceeded, ensuring that the cache always contains the most critical tokens. TRIM-KV is trained efficiently through distillation from a frozen LLM combined with a capacity loss, requiring…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper introduces a trainable approach to KV cache eviction that learns token-level importance scores through retention gates. This design allows the model to identify and retain intrinsically important tokens 2. The authors conduct experiments across multiple mathematical reasoning tasks and show competitive results.
1. The paper lacks comparison with LocRet, a highly relevant recent work that also computes importance scores for each token and discards low-importance tokens to reduce memory overhead. Locret: Enhancing eviction in long-context LLM inference with trained retaining heads 2. The paper would benefit from comprehensive evaluation on established long-context understanding benchmarks such as RULER and LongBench-V2 to validate the method's effectiveness. 3. The use of exponential decay for comput
1. Novel idea, fits cleanly into existing models, and works well without heavy changes. 2. Shows solid gains across tasks, occasionally even beating full-cache runs. 3. Lightweight enough to feel practical, not just academic.
Since it’s still a trained model, it raises the natural question of why this isn’t folded into normal model training, and the paper doesn’t address that integration path. It also doesn’t discuss how the learned decay interacts with positional encoding, which leaves some ambiguity around whether it’s learning true importance or just reinforcing a recency style bias.
1. Novel and well-motivated idea: The paper identifies a fundamental limitation of attention-based eviction — that “recent attention ≠ importance” — and replaces it with a predictive, intrinsic importance estimation per token. This shift from reactive to proactive cache management is conceptually elegant and well-justified. 2. Brain-inspired design: Modeling token importance decay via an exponential forgetting curve connects the method to cognitive science (Ebbinghaus), offering both theoretical
Limited exploration of dynamic budgets: The current method assumes a fixed memory budget M. It would be interesting to explore adaptive budgets (e.g., varying by layer, head, or task), especially since the authors mention this as future work.
Code & Models
- 🤗ngocbh/TrimKV-Qwen3-4B-Mathmodel· 2 dl2 dl
- 🤗ngocbh/TrimKV-Qwen3-8B-Mathmodel· 3 dl3 dl
- 🤗ngocbh/TrimKV-Qwen3-14B-Mathmodel· 4 dl4 dl
- 🤗ngocbh/TrimKV-Qwen3-1.7B-Mathmodel· 8 dl8 dl
- 🤗ngocbh/TrimKV-Qwen3-4B-Instruct-2507model· 2 dl2 dl
- 🤗ngocbh/TrimKV-Phi-3-mini-128k-instructmodel· 3 dl3 dl
- 🤗ngocbh/TrimKV-DeepSeek-R1-Distill-Llama-8Bmodel· 3 dl3 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Machine Learning in Materials Science
