Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
Tho Mai, Joo-Young Kim

TL;DR
This paper introduces LaProx, a new method for KV cache eviction in long-context LLM inference that models token importance more accurately, enabling significant cache reduction with minimal performance loss.
Contribution
It reformulates KV cache eviction as an output-aware, layer-wise matrix approximation problem and proposes a unified, globally comparable importance scoring strategy.
Findings
Maintains model performance with only 5% KV cache usage.
Outperforms prior methods across 19 datasets on LongBench and Needle-In-A-Haystack.
Reduces accuracy loss by up to 2× under extreme cache compression.
Abstract
Large language models (LLMs) support long-context inference but suffer from substantial memory and runtime overhead due to Key-Value (KV) Cache growth. Existing KV Cache eviction methods primarily rely on local attention weights, neglecting the influence of value representations, output projection, and inter-head interactions. In this work, we reformulate KV Cache eviction from a conventional head-wise, weight-averaging approach into an output-aware, layer-wise matrix multiplication approximation problem. We introduce LaProx, a novel eviction strategy that explicitly models the multiplicative interaction between attention maps and projected value states to accurately quantify token contributions while accounting for inter-head dependencies. Building on this metric, we propose the first unified eviction strategy that assigns globally comparable importance scores to tokens, enabling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
