Towards Threshold-Free KV Cache Pruning
Xuanfan Ni, Liyan Xu, Chenyang Lyu, Longyue Wang, Mo Yu, Lemao Liu, Fandong Meng, Jie Zhou, Piji Li

TL;DR
This paper introduces ReFreeKV, a novel method for KV cache pruning in large language models that automatically adjusts cache budgets without relying on pre-set thresholds, ensuring robust performance across diverse datasets.
Contribution
The paper proposes a threshold-free KV cache pruning method, ReFreeKV, which adapts to input variability and eliminates the need for domain-specific tuning.
Findings
ReFreeKV outperforms existing methods on 13 diverse datasets.
It maintains full-cache performance without pre-defined budget thresholds.
The approach is effective across various context lengths, tasks, and model sizes.
Abstract
To reduce memory consumption during LLM inference, prior works have proposed numerous methods that focus on KV cache pruning based on various criteria. While these techniques often accomplish lossless memory reduction on many datasets, they often rely on an under-emphasized condition: a dataset/domain-specific budget size threshold needs to be pre-determined to achieve the optimal performance. However, such input-specific tuning may be considerably limited in real-world scenarios, as open-domain inputs span diverse domains, lengths and difficulty levels, without clear boundaries for pre-tuning. Thus, the dependence of an input-sensitive threshold can be an inherent limitation that may cause large degradation on arbitrary inputs. In this work, we propose a new objective that lifts the threshold constraints for robust KV pruning, calling for "threshold-free" methods that automatically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need · Pruning
