Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning
Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, Wen Xiao

TL;DR
This paper introduces HeadKV, a novel head-level KV cache compression method that selectively retains important attention heads based on their estimated importance for retrieval and reasoning, significantly reducing memory while maintaining high performance.
Contribution
It proposes a new head-level KV cache compression technique with a contextual reasoning importance estimation, outperforming existing methods especially in low-resource scenarios.
Findings
Retains only 1.5% of KV cache while achieving 97% of full cache performance.
Outperforms strong baselines across diverse benchmarks and models.
Effective in low-resource settings with small KV sizes (64 & 128).
Abstract
Key-Value (KV) caching is a common technique to enhance the computational efficiency of Large Language Models (LLMs), but its memory overhead grows rapidly with input length. Prior work has shown that not all tokens are equally important for text generation, proposing layer-level KV cache compression to selectively retain key information. Recognizing the distinct roles of attention heads in generation, we propose HeadKV, a head-level KV cache compression method, and HeadKV-R2, which leverages a novel contextual reasoning ability estimation for compression. Our approach operates at the level of individual heads, estimating their importance for contextual QA tasks that require both retrieval and reasoning capabilities. Extensive experiments across diverse benchmarks (LongBench, LooGLE), model architectures (e.g., Llama-3-8B-Instruct, Mistral-7B-Instruct), and long-context abilities tests…
Peer Reviews
Decision·ICLR 2025 Poster
1. Reasoning head-level kV cache allocation and importance score estimation. 2. Performance can be consistently better or comparable with the full KV setting.
1. We do not see much improvement on latency and memory as compared against Ada-KV, as I believe this work is based on Ada-KV.
1. The KV cache budget allocation strategy maintains the total amount of KV cache constant and keeps inference time unchanged. 2. Using top-k attention weights to refine the score can enhance the retrieval-head evaluation. 3. The proposed retrieve-reasoning dataset may benefit future works.
1. How the S_h is normalized is not mentioned in this paper. In Equation (4), it seems that S_h should sum to one across all heads and all layers. 2. How the retrieval-reasoning dataset is generated is not mentioned in the paper. 3. The left subfigure in Figure 6 saids decoding times but line 512 mentioned that the decoding time includes prefilling time. This is quite confusing. In the figure, the prefill time of each method (when generating a length of 0) is squeezed into the same point on th
- The study presents a new way to compress KV cache based on different types of attention head is novel, even though there are concurrent works that also work on the same idea. - The performance of the proposed method surpasses other baselines considered in the paper by a considerable margin at the extreme cases when retained KV size is small.
1. Even though the paper claims to successfully identify heads that can do both retrieval and reasoning for **long-context** task, which is better than the retrieval-only heads initially proposed in Wu et al. (2024), the NIAH experiment setting in the paper is not long enough (longest prompt = 8k). I believe 8k is considered to be not long enough nowadays. Can the author try longer NIAH test such as 64k or 128k to show the effectiveness of the identified R2-heads? 2. I believe there is a work ca
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Parallel Computing and Optimization Techniques · Network Packet Processing and Optimization
MethodsSoftmax · Attention Is All You Need
