Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning

Yu Fu; Zefan Cai; Abedelkadir Asi; Wayne Xiong; Yue Dong; Wen Xiao

arXiv:2410.19258·cs.CL·October 24, 2025

Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning

Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, Wen Xiao

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces HeadKV, a novel head-level KV cache compression method that selectively retains important attention heads based on their estimated importance for retrieval and reasoning, significantly reducing memory while maintaining high performance.

Contribution

It proposes a new head-level KV cache compression technique with a contextual reasoning importance estimation, outperforming existing methods especially in low-resource scenarios.

Findings

01

Retains only 1.5% of KV cache while achieving 97% of full cache performance.

02

Outperforms strong baselines across diverse benchmarks and models.

03

Effective in low-resource settings with small KV sizes (64 & 128).

Abstract

Key-Value (KV) caching is a common technique to enhance the computational efficiency of Large Language Models (LLMs), but its memory overhead grows rapidly with input length. Prior work has shown that not all tokens are equally important for text generation, proposing layer-level KV cache compression to selectively retain key information. Recognizing the distinct roles of attention heads in generation, we propose HeadKV, a head-level KV cache compression method, and HeadKV-R2, which leverages a novel contextual reasoning ability estimation for compression. Our approach operates at the level of individual heads, estimating their importance for contextual QA tasks that require both retrieval and reasoning capabilities. Extensive experiments across diverse benchmarks (LongBench, LooGLE), model architectures (e.g., Llama-3-8B-Instruct, Mistral-7B-Instruct), and long-context abilities tests…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 3

Strengths

1. Reasoning head-level kV cache allocation and importance score estimation. 2. Performance can be consistently better or comparable with the full KV setting.

Weaknesses

1. We do not see much improvement on latency and memory as compared against Ada-KV, as I believe this work is based on Ada-KV.

Reviewer 02Rating 6Confidence 4

Strengths

1. The KV cache budget allocation strategy maintains the total amount of KV cache constant and keeps inference time unchanged. 2. Using top-k attention weights to refine the score can enhance the retrieval-head evaluation. 3. The proposed retrieve-reasoning dataset may benefit future works.

Weaknesses

1. How the S_h is normalized is not mentioned in this paper. In Equation (4), it seems that S_h should sum to one across all heads and all layers. 2. How the retrieval-reasoning dataset is generated is not mentioned in the paper. 3. The left subfigure in Figure 6 saids decoding times but line 512 mentioned that the decoding time includes prefilling time. This is quite confusing. In the figure, the prefill time of each method (when generating a length of 0) is squeezed into the same point on th

Reviewer 03Rating 6Confidence 4

Strengths

- The study presents a new way to compress KV cache based on different types of attention head is novel, even though there are concurrent works that also work on the same idea. - The performance of the proposed method surpasses other baselines considered in the paper by a considerable margin at the extreme cases when retained KV size is small.

Weaknesses

1. Even though the paper claims to successfully identify heads that can do both retrieval and reasoning for **long-context** task, which is better than the retrieval-only heads initially proposed in Wu et al. (2024), the NIAH experiment setting in the paper is not long enough (longest prompt = 8k). I believe 8k is considered to be not long enough nowadays. Can the author try longer NIAH test such as 64k or 128k to show the effectiveness of the identified R2-heads? 2. I believe there is a work ca

Code & Models

Repositories

fyyfu/headkv
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAlgorithms and Data Compression · Parallel Computing and Optimization Techniques · Network Packet Processing and Optimization

MethodsSoftmax · Attention Is All You Need