AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models

Yifeng Gu; Zicong Jiang; Jianxiu Jin; Kailing Guo; Ziyang Zhang; Xiangmin Xu

arXiv:2506.03762·cs.CL·June 5, 2025

AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models

Yifeng Gu, Zicong Jiang, Jianxiu Jin, Kailing Guo, Ziyang Zhang, Xiangmin Xu

PDF

Open Access

TL;DR

AhaKV introduces an adaptive, holistic attention-driven KV cache eviction method that reduces bias in token importance scoring, improving memory efficiency and global context access in large language model inference.

Contribution

The paper proposes AhaKV, a novel adaptive method that refines token importance scores using holistic attention and value vectors, addressing bias issues in previous eviction strategies.

Findings

01

AhaKV effectively mitigates bias in token importance scoring.

02

AhaKV retains more crucial tokens for global context.

03

AhaKV achieves state-of-the-art results on benchmark tasks.

Abstract

Large Language Models (LLMs) have significantly advanced the field of Artificial Intelligence. However, their deployment is resource-intensive, not only due to the large number of model parameters but also because the (Key-Value) KV cache consumes a lot of memory during inference. While several works propose reducing the KV cache by evicting the unnecessary tokens, these approaches rely on accumulated attention score as eviction score to quantify the importance of the token. We identify the accumulated attention score is biased and it decreases with the position of the tokens in the mathematical expectation. As a result, the retained tokens concentrate on the initial positions, limiting model's access to global contextual information. To address this issue, we propose Adaptive holistic attention KV (AhaKV), it addresses the bias of the accumulated attention score by adaptively tuning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Advanced Neural Network Applications · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Softmax