Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction

Ziyao Tang; Pengkun Jiao; Xinhang Chen; Wei Liu; Shiyong Li; Jingjing Chen

arXiv:2602.08585·cs.LG·February 10, 2026

Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction

Ziyao Tang, Pengkun Jiao, Xinhang Chen, Wei Liu, Shiyong Li, Jingjing Chen

PDF

Open Access

TL;DR

This paper introduces LU-KV, a novel framework for KV cache eviction that optimizes long-term utility across attention heads, significantly reducing cache size and inference latency in large models.

Contribution

The paper proposes LU-KV, a new head-level budget allocation method for KV cache eviction based on marginal utility, improving efficiency over heuristic approaches.

Findings

01

Achieves 80% reduction in KV cache size

02

Reduces inference latency and GPU memory footprint

03

Maintains model performance with minimal degradation

Abstract

Given the quadratic complexity of attention, KV cache eviction is vital to accelerate model inference. Current KV cache eviction methods typically rely on instantaneous heuristic metrics, implicitly assuming that score magnitudes are consistent proxies for importance across all heads. However, this overlooks the heterogeneity in predictive fidelity across attention heads. While certain heads prioritize the instantaneous contribution of tokens, others are dedicated to capturing long-horizon utility. In this paper, we propose that optimal budget allocation should be governed by the marginal utility in preserving long-term semantic information. Based on this insight, we propose LU-KV, a novel framework that optimizes head-level budget allocation through a convex-hull relaxation and a marginal-utility-based greedy solver to achieve near-optimal precision. Furthermore, we implement a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Data Storage Technologies · Big Data and Digital Economy