AhaKV: Adaptive Holistic Attention-Driven KV Cache Eviction for Efficient Inference of Large Language Models
Yifeng Gu, Zicong Jiang, Jianxiu Jin, Kailing Guo, Ziyang Zhang, Xiangmin Xu

TL;DR
AhaKV introduces an adaptive, holistic attention-driven KV cache eviction method that reduces bias in token importance scoring, improving memory efficiency and global context access in large language model inference.
Contribution
The paper proposes AhaKV, a novel adaptive method that refines token importance scores using holistic attention and value vectors, addressing bias issues in previous eviction strategies.
Findings
AhaKV effectively mitigates bias in token importance scoring.
AhaKV retains more crucial tokens for global context.
AhaKV achieves state-of-the-art results on benchmark tasks.
Abstract
Large Language Models (LLMs) have significantly advanced the field of Artificial Intelligence. However, their deployment is resource-intensive, not only due to the large number of model parameters but also because the (Key-Value) KV cache consumes a lot of memory during inference. While several works propose reducing the KV cache by evicting the unnecessary tokens, these approaches rely on accumulated attention score as eviction score to quantify the importance of the token. We identify the accumulated attention score is biased and it decreases with the position of the tokens in the mathematical expectation. As a result, the retained tokens concentrate on the initial positions, limiting model's access to global contextual information. To address this issue, we propose Adaptive holistic attention KV (AhaKV), it addresses the bias of the accumulated attention score by adaptively tuning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Advanced Neural Network Applications · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Softmax
