Towards Threshold-Free KV Cache Pruning

Xuanfan Ni; Liyan Xu; Chenyang Lyu; Longyue Wang; Mo Yu; Lemao Liu; Fandong Meng; Jie Zhou; Piji Li

arXiv:2502.16886·cs.CL·January 7, 2026

Towards Threshold-Free KV Cache Pruning

Xuanfan Ni, Liyan Xu, Chenyang Lyu, Longyue Wang, Mo Yu, Lemao Liu, Fandong Meng, Jie Zhou, Piji Li

PDF

TL;DR

This paper introduces ReFreeKV, a novel method for KV cache pruning in large language models that automatically adjusts cache budgets without relying on pre-set thresholds, ensuring robust performance across diverse datasets.

Contribution

The paper proposes a threshold-free KV cache pruning method, ReFreeKV, which adapts to input variability and eliminates the need for domain-specific tuning.

Findings

01

ReFreeKV outperforms existing methods on 13 diverse datasets.

02

It maintains full-cache performance without pre-defined budget thresholds.

03

The approach is effective across various context lengths, tasks, and model sizes.

Abstract

To reduce memory consumption during LLM inference, prior works have proposed numerous methods that focus on KV cache pruning based on various criteria. While these techniques often accomplish lossless memory reduction on many datasets, they often rely on an under-emphasized condition: a dataset/domain-specific budget size threshold needs to be pre-determined to achieve the optimal performance. However, such input-specific tuning may be considerably limited in real-world scenarios, as open-domain inputs span diverse domains, lengths and difficulty levels, without clear boundaries for pre-tuning. Thus, the dependence of an input-sensitive threshold can be an inherent limitation that may cause large degradation on arbitrary inputs. In this work, we propose a new objective that lifts the threshold constraints for robust KV pruning, calling for "threshold-free" methods that automatically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need · Pruning