KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing
Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai, Zhao, Zhi Chen

TL;DR
KVSharer is a novel method that reduces memory and computation in large language model inference by sharing dissimilar key-value caches across layers, achieving significant efficiency gains without major performance loss.
Contribution
It introduces a layer-wise KV cache sharing technique that surprisingly benefits from dissimilar cache sharing, and demonstrates compatibility with existing intra-layer compression methods.
Findings
Reduces KV cache memory by 30% during inference.
Achieves at least 1.3x faster generation.
Compatible with existing intra-layer compression methods.
Abstract
The development of large language models (LLMs) has significantly expanded model sizes, resulting in substantial GPU memory requirements during inference. The key and value storage of the attention map in the KV (key-value) cache accounts for more than 80\% of this memory consumption. Nowadays, most existing KV cache compression methods focus on intra-layer compression within a single Transformer layer but few works consider layer-wise compression. In this paper, we propose a plug-and-play method called \textit{KVSharer}, which shares the KV cache between layers to achieve layer-wise compression. Rather than intuitively sharing based on higher similarity, we discover a counterintuitive phenomenon: sharing dissimilar KV caches better preserves the model performance. Experiments show that \textit{KVSharer} can reduce KV cache computation by 30\%, thereby lowering memory consumption…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
This paper and the technique introduced have the following strengths: 1. Paper writing is easy to follow with good figures and illustrations. 2. The experiment sections demonstrate KVSharer can be used in orthogonal with other intra-layer KV compression techniques like H2O and PyramidInfer to achieve higher memory saving and more significant speedup. 3. The paper brings up a new angle
I have several concerns about the paper: 1. Even though layer pairs are ranked from high dis-similarity to low dis-similarity, whether to use the pair still depends on the cosine similarity between the the KV-cache compression model and the original model. There is a possibility that the cosine similarity check, rather than dis-similarity ranking, plays a major role. 2. A major claim in the paper is dis-similarity metrics is better than similarity metrics when it comes to inter-layer KV cache
- S1. This paper explores an important problem of improving the efficiency in utilizing KV cache in LLM generative inference. - S2. The related work and research context are well summarized.
- W1. Heuristic-based on aggregated information. As enumerated in Section 3.1.2, the proposed method uses the averaged value of the KV-cache to consider the similarity between different layers -- it is a little confusing why such highly integrated information could guide the sharing policy, considering lots of recent work has been exploring the KV-cache utilization at token, layer, and head level jointly. My concern is whether such a highly aggregated metric is informative or not. - W2. My main
1. This paper addresses a good research topic: efficient LLM inference. 2. The paper is well-organized. 3. The proposed method is clearly presented.
1. **Lack of novelty and research depth:** This main technique is to share the dissimilar KV cache for efficient inference, which is quite simple. Although authors claim that this originates from a counterintuitive observation, there is no motivation provided in the methodology section. Therefore, both of the novelty and the research depth of this paper are not qualified for the top AI conference. 2. **Unreasonable observation without further analysis:** The observation that the sharing the
* Does not require training * Provides an interesting and novel insight that sharing dissimilar KV caches yields better performance. * Offers diverse and insightful evaluation results.
* Results show a noticeable performance drop even at low compression rates (e.g., 12.5%, 25%), which may limit the practicality of the method. * Lacks an explanation for why sharing dissimilar KV caches yields better performance, leaving an essential aspect of the method's effectiveness rather unclear.
This idea offers new insights into how memory size can be further reduced, potentially leading to more efficient model deployments and optimized hardware utilization.
1) The paper lacks a comparison with other cache-sharing methods, which would provide a clearer understanding of its advantages. 2) It should consider the scenario when the KV cache is quantized, as quantization is often used during inference to save energy. 3) The paper also lacks a scalability analysis, which is crucial for evaluating how well the proposed method performs as model size and complexity increase.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCaching and Content Delivery · Advanced Wireless Network Optimization · Advanced Data Compression Techniques
MethodsLinear Layer · Dense Connections · Multi-Head Attention · Adam · Softmax · Dropout · Absolute Position Encodings · Label Smoothing · Byte Pair Encoding · Layer Normalization
