TL;DR
CommonKV is a training-free method that uses cross-layer parameter sharing via SVD to compress KV caches in LLMs, significantly reducing memory usage with minimal performance impact.
Contribution
It introduces a novel, training-free approach for KV cache compression through adjacent parameter sharing using SVD, improving over existing methods.
Findings
Outperforms existing low-rank and cross-layer methods at various compression ratios.
Achieves up to 98% compression with minimal performance loss.
Compatible with other quantization and eviction techniques.
Abstract
Large Language Models (LLMs) confront significant memory challenges due to the escalating KV cache with increasing sequence length. As a crucial technique, existing cross-layer KV cache sharing methods either necessitate modified model architectures with subsequent pre-training or incur significant performance degradation at high compression rates. To mitigate these challenges, we propose CommonKV, a training-free method for cross-layer KV cache compression through adjacent parameters sharing. Inspired by the high similarity observed in cross-layer hidden states, we utilize Singular Value Decomposition (SVD) to achieve weight sharing across adjacent parameters, resulting in a more easily mergeable latent KV cache. Furthermore, we also introduce an adaptive budget allocation strategy. It dynamically assigns compression budgets based on cosine similarity, ensuring that dissimilar caches…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Thank you for submitting this paper to ICLR! KV cache compression is one of the most important and popular topics in efficient LLM inference. I appreciate the authors' efforts on the analysis of limitations in existing methods (not training free, and performance degradation under aggressive compression). Cross-layer parameter sharing is a very reasonable idea (somehow explored before). The evaluation baselines are strong and timely methods in the field as well. In particular, section 6 gives a v
1. As compared to other KV cache compression papers in the community, the evaluation is flawed in many aspects, including but not limited to context window length of 8K, model sizes (7/8B), rationale of hyperparameter choices, system performance metrics, etc. Please refer to "questions" for a comprehensive list. 2. There are quite a few unfilled question marks (??) and TODOs in the current draft. 3. Figures 1, 3, 4 are hard to read --- Please consider enlarging the fonts.
The method's core strength is its novel idea of creating a more consistent latent cache via parameter sharing, which directly addresses the root cause of poor performance in direct KV cache merging. Its training-free nature makes it highly practical and easy to apply to existing models. CommonKV demonstrates superior empirical performance over baselines at high compression ratios with minimal inference latency overhead. Furthermore, its ability to be combined with quantization and eviction metho
The paper would be stronger with a more detailed sensitivity analysis of the SVD rank hyperparameter. Additionally, the handling of GQA models feels like a workaround, and a deeper analysis of the architectural interaction would be beneficial.
* Neat insight into using parameter sharing across layers to compress based on similarity between adjacent layers of the KV Cache. * Allows dynamic adaptation across different groups of KV layers for better performance. * Training-free method of KV merging leads to lower offline compute overhead * Results are very impressive compared to all the baselines.
* *To reduce computational overhead, we only use the cosine similarity between the first and last layers within each group as its score* It needs further justification that it is sufficient to use just the first and last layer within each group without sacrificing quality. * It is not clear why the latent cache is more easily mergeable. I see the claim that it has more consistent hidden states, but this claim needs further explanation. This is especially important as there is a two-way overhead
- an interesting observation regarding the KV-generator being compressible is utilized to effectively reduce inference overhead. - no pretraining/fine-tuning makes it quite practical - strong performance with good latency analysis. also orthogonal to other optimizations (eviction/quantization) - discusses SVD cost of xKV well, does not have the same limitations, addresses a good set of baselines.
- Several figure references are broken, TODO -- should be removed.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
