ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration
Xianglong Yan, Zhiteng Li, Tianao Zhang, Haotong Qin, Linghe Kong, Yulun Zhang, Xiaokang Yang

TL;DR
ReCalKV introduces a novel post-training low-rank KV cache compression method for large language models, employing head reordering and offline calibration to maintain performance while significantly reducing memory usage.
Contribution
It presents ReCalKV, a new approach that separately optimizes Key and Value caches using head-wise reordering and offline calibration, improving compression effectiveness.
Findings
ReCalKV outperforms existing methods in high compression scenarios.
It achieves minimal performance loss at high compression ratios.
The approach is validated through extensive experiments.
Abstract
Large language models (LLMs) have demonstrated remarkable performance, but their long-context reasoning remains constrained by the excessive memory required for the Key-Value (KV) cache. This makes KV cache compression a critical step toward efficient long-context inference. Recent methods have explored low-rank techniques to reduce the hidden size of the KV cache. However, they neglect the distinct roles and varying importance of Keys and Values, leading to significant performance drops under high compression. To address this, we propose ReCalKV, a post-training low-rank KV cache compression approach with tailored strategies for Keys and Values. For Keys, we propose Head-wise Similarity aware Reordering (HSR), which clusters structurally similar heads into groups, enabling more accurate low-rank approximation via grouped SVD. For Values, we propose Offline Value Calibration (OVC),…
Peer Reviews
Decision·Submitted to ICLR 2026
- Strong improvements over PaLU on the evaluated models. - Comprehensive ablation studies demonstrating the effectiveness of each contribution for both key and value projections.
**[W1]** The evaluated models are outdated. I suggest moving the results on the Llama-3.1 model to the main body. This is important because many modern LLM architectures employ GQA, while only the Mistral model from the main results section does so. Validation on multiple models with GQA would strengthen the paper. **[W2]** Comments on writing: - *L55–60:* This is not something revealed by your analysis. - *L60–63:* I cannot find a section describing your analysis of Fisher information. - The f
1. The paper identifies and analyzes the asymmetric roles of Keys and Values, particularly emphasizing that individual attention heads differ in information content. Using CKA-based head reordering before SVD to minimize approximation error is a well-motivated and conceptually sound idea. 2. Across multiple model families and compression ratios, ReCalKV demonstrates competitive or superior results compared with the main low-rank baseline (Palu). The method maintains high accuracy even under agg
1. The paper mainly compares with low-rank SVD-based approaches such as Palu, but lacks comparisons with other classes of KV cache compression methods (e.g., KIVI, KVQuant, or token eviction approaches). As a result, the reader cannot fully assess how ReCalKV performs in a broader landscape of KV compression techniques — especially when low-rank compression is not necessarily the only or best strategy. 2. Experiments focus on older LLaMA/Mistral models, with limited evaluation on recent archite
* K-side uses CKA-guided head reordering + grouped SVD to share low-rank factors among similar heads (greedy pairing; fixed group size), and V-side uses closed-form offline calibration to minimize projection error on a small calibration set. * Matrix fusion folds (R_v) into (W_o), eliminating online reconstruction and avoiding extra inference ops; the end-to-end procedure is fully post-training/offline (Algorithm 1). * Strong ablations isolate HSR and OVC and show they are complementary at
- Code not provided, therefore it's not reproducible as is. - While the method is effective, baselines are limited primarily to Palu (G-LRD), lacking comparisons with recent variants like CommonKV or FDC, which could better substantiate SOTA claims (section 4). - Experiments do not quantify runtime overhead from Key reconstruction post-HSR (Figure 3), despite claims of low cost; real-world latency measurements on diverse hardware would strengthen efficiency arguments (Figure 4). - Equations (7)
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Machine Learning in Healthcare
MethodsSoftmax · Attention Is All You Need
