Minimal-Intervention KV Retention via Set-Conditioned Diversity
Libo Sun, Po-wei Harn, Peixiong He, Xiao Qin

TL;DR
This paper introduces a simple yet effective modification to KV-cache retention scoring called alpha, which outperforms more complex redesigns in small-budget mathematical reasoning tasks.
Contribution
The paper proposes alpha, a one-function change to the TriAttention scorer, demonstrating its effectiveness over structural redesigns in KV-cache compression for reasoning models.
Findings
Alpha outperforms heavier redesigns at small budgets.
Alpha clears Bonferroni correction in two model/budget settings.
Minimal scoring modification beats structural redesigns in this regime.
Abstract
KV-cache compression at small budgets is a crowded design space spanning cache representation, head-wise routing, compression cadence, decoding behavior, and within-budget scoring. We study seven mechanisms across these five families under matched mean cache on long-form mathematical reasoning (MATH-500~\cite{hendrycks2021math}) with two distilled-reasoning models (Qwen-7B and Llama-8B variants of DeepSeek-R1-Distill~\cite{deepseek2025r1}) at budgets . All seven were rejected. We then propose , a one-function modification to the TriAttention~\cite{mao2026triattention} retention scorer that replaces argmax-top- with greedy facility-location-inspired selection under a V-space redundancy penalty controlled by a single weight . A pre-registered protocol tunes on a frozen development split and confirms on a disjoint held-out split; with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
