CompressKV: Semantic Retrieval Heads Know What Tokens are Not Important Before Generation
Xiaolin Lin, Jingcun Wang, Olga Kondrateva, Yiyu Shi, Bing Li, Grace Li Zhang

TL;DR
CompressKV introduces a method to identify and retain important tokens in large language models by analyzing attention heads, improving memory efficiency without sacrificing performance.
Contribution
The paper proposes a head-specific token importance detection method and a layer-adaptive cache allocation strategy for better KV cache compression in GQA-based LLMs.
Findings
Outperforms state-of-the-art cache compression methods.
Effectively retains important tokens, improving model performance.
Reduces memory usage across various benchmarks.
Abstract
Recent advances in large language models (LLMs) have significantly boosted long-context processing. However, the increasing key-value (KV) cache size poses critical challenges to memory and execution efficiency. Most KV cache compression methods rely on heuristic token eviction using all attention heads in Grouped Query Attention (GQA)-based LLMs. This method ignores the different functionalities of attention heads, leading to the eviction of critical tokens and thus degrades the performance of LLMs. To address the issue above, instead of using all the attention heads in GQA-based LLMs to determine important tokens as in the previous work, we first identify the attention heads in each layer that are not only capable of retrieving the initial and final tokens of a prompt, but also capable of retrieving important tokens within the text and attending to their surrounding semantic…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
- The paper includes comprehensive experimental comparisons with multiple baseline methods, demonstrating improvements on LongBench and Needle-in-a-Haystack. - The approach delivers strong end-to-end inference efficiency, highlighting its practical applicability for long-context scenarios.
Several claims require corrections or proper references - The statement that prior work treats all attention heads equally (L154) is inaccurate. For example, DuoAttention (ICLR 2025) explicitly differentiates retrieval and streaming heads, contradicting this generalization. - Claims in L169–170 regarding reliance on attention statistics (e.g., entropy, variance) lack citations or supporting references. - (minor) L172, which states that methods adopt a fixed allocation strategy based on attention
The attention heads are identified as semantic retrieval heads for a high ratio of KV Cache compression: 1) the Semantic Retrieval Score is defined over the entire answer span inserted into a long context; 2) then the score is averaged and ranked to determine the important tokens and evicted tokens, then the important token indexes are shared across different heads; 3) Although the compression budget are adaptive allocated for different heads by using the error-aware method. Therefore the KV ca
The concept of semantic retrieval heads is based on retrieval heads for important information storage and retrieval. And the token eviction may lead to significant information loss, especially for the long-cot or ReAT agent scenarios. And the compression is based on the statistics of Semantic Retrieval Score, which relies on the Attention weights of tokens within the answer span, so the online overhead may be large, hindering the practical application.
1) Robust Semantic Head Identification. The proposed identification of Semantic Retrieval Heads through answer-span attention aggregation effectively mitigates the limitations of conventional top-k single-token attention methods, which may overlook semantically distributed relevance. 2) Efficient Offline Layer-wise Importance Estimation. The offline computation of per-layer importance avoids the heavy online overhead faced by methods like CAKE or PyramidKV, enabling more efficient runtime compre
1) Potential Circular Reasoning. If SRHs are identified using ground-truth answers and later evaluated on the same benchmark, the method may inadvertently benefit from prior exposure to the correct spans, leading to overly optimistic results. Although stages strength 1 and strength 2 appear effective in identifying semantic and streaming heads, the process relies on test-set analysis of head and layer behavior based on known answers. 2) Limited Generalization Beyond Retrieval-Oriented Tasks. The
- *Introduces Semantic Retrieval Heads (SRHs) via span-aggregation (vs. peak-driven top-k) to fix mid-prompt token eviction from Streaming Head dominance in GQA models. Also proposes offline Frobenius-norm error analysis for layer-adaptive budgeting, avoiding online attention stats dependency. - Outperforms 6 baselines on multiple GQA models (including 14B/32B scales) across LongBench (16 subtasks) and Needle-in-a-Haystack (2K–128K contexts). Key results: +2.5 points vs. HeadKV (256-token budget
**Methodological Limitations**: The identification of Semantic Retrieval Heads (SRHs) relies on Needle-in-a-Haystack-style prompts, which may bias toward retrieval tasks (e.g., QA) and limit generalization to non-QA scenarios like creative generation or multi-turn dialogues. For instance, the SRH scoring formula (Eq. 5) assumes clear answer spans but may degrade to top-1 criteria for short or ambiguous spans, reducing accuracy. Additionally, the offline Frobenius-norm error analysis uses LongBen
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Big Data and Digital Economy
