PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, Wen Xiao

TL;DR
PyramidKV introduces a layer-wise dynamic KV cache compression method based on pyramidal information funneling in LLMs, significantly reducing memory while maintaining or improving long-context processing accuracy.
Contribution
This paper presents PyramidKV, a novel dynamic KV cache compression technique that allocates cache resources based on layer importance, inspired by information aggregation patterns in LLMs.
Findings
PyramidKV matches full cache performance with only 12% cache usage.
At 0.7% cache, PyramidKV outperforms other methods with up to 20.5% accuracy gain.
128 KV entries suffice for LLAMA-3-70B to achieve perfect accuracy.
Abstract
In this study, we investigate whether attention-based information flow inside large language models (LLMs) is aggregated through noticeable patterns for long context processing. Our observations reveal that LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers, progressively consolidating within specific contexts, and ultimately focusing on critical tokens (a.k.a massive activation or attention sink) in higher layers. Motivated by these insights, we developed PyramidKV, a novel and effective KV cache compression method. This approach dynamically adjusts the KV cache size across different layers, allocating more cache in lower layers and less in higher ones, diverging from traditional methods that maintain a uniform KV cache size. Our experimental evaluations, utilizing the LongBench benchmark, show that PyramidKV matches…
Peer Reviews
Decision·Submitted to ICLR 2025
1. Key cache compression is an important topic. 2. The idea of PyramidKV is explained clearly.
1. The observation that the attention scores are more uniform in the first layers but become more skewed in the last layers is NOT new, see [1][2] for example. With the observation, it is straightforward to extend existing KV cache selection methods to use different sampling ratios for different layers. This limits the novelty of the paper. [1] InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management [2] MagicPIG: LSH Sampling for Efficient LLM Genera
1. The observation on the pyramid pattern of attention scores across layers is valuable. 2. Based on the observed pattern, the proposed method is straight-forward and performant under resource-intensive circumstances. 3. The experiment is comprehensive.
1. The proposed method works really well under extreme condition, i.e.e KV cache size = 128. However, under not-so-extreme cases, i.e. KV cache size = 2048, the performance is not comparable to other baselines according to Table 1 in the paper. Is there any explanation to this phenomenon? I think the paper worth a small section of ablation study to explain this phenomenon. 2. In [1], Wu et al. claims that "retrieval heads" exist across models, functioning similarly to the submission's patterns (
1)The paper analyzes Attention data from different layers of LLM and discovers that LLMs aggregate information through Pyramidal Information Funneling patterns. 2)The paper is the first to propose an algorithm using different compression rates for KV Cache at different layers, which can be used with other KV Cache algorithms. 3)In scenarios with extremely high KV Cache compression rates(like 99.3%), this method can achieve better accuracy compared to other existing algorithm.
1)When the KV budget is retained at 2k, the accuracy of the proposed method does not show significant advantages. 2)The paper mainly tests models with an 8k context length, lacking accuracy tests for models with sequence lengths above 128k. 3)In cases of extremely low compression ratios, it is recommended to include comparisons with new technologies such as Minference.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAlgorithms and Data Compression · Caching and Content Delivery · Advanced Data Compression Techniques
