MatryoshkaKV: Adaptive KV Compression via Trainable Orthogonal Projection
Bokai Lin, Zihao Zeng, Zipeng Xiao, Siqi Kou, Tianqi Hou, Xiaofeng Gao, Hao Zhang, Zhijie Deng

TL;DR
This paper introduces MatryoshkaKV, a novel adaptive compression method for KV caches in large language models, using trainable orthogonal projections to reduce memory usage while maintaining high performance.
Contribution
It proposes a trainable orthogonal projection approach with a Matryoshka training strategy for adaptive KV cache compression in LLMs, addressing performance degradation at low compression rates.
Findings
Achieves over 90% performance retention at 60% compression rate.
Supports up to 75% compression in extreme scenarios.
Compatible with pre-trained LLMs like LLaMA2-7B and Mistral-7B.
Abstract
KV cache has become a de facto technique for the inference of large language models (LLMs), where tensors of shape (layer number, head number, sequence length, feature dimension) are introduced to cache historical information for self-attention. As the size of the model and data grows, the KV cache can quickly become a bottleneck within the system in both storage and memory transfer. To address this, prior studies usually focus on the first three axes of the cache tensors for compression. This paper supplements them, focusing on the feature dimension axis, by utilizing low-rank projection matrices to transform the cache features into spaces with reduced dimensions. We begin by investigating the canonical orthogonal projection method for data compression through principal component analysis (PCA). We observe the issue with PCA projection where significant performance degradation is…
Peer Reviews
Decision·ICLR 2025 Poster
1. The proposed Matryoshka training strategy effectively preserves hierarchical structures in orthogonal matrices inherited from PCA at various compression levels, ensuring robust performance across dimensions. 2. Greedy search algorithm effectively adapts to differing sparsity in each $𝑊_𝑘$ and $𝑊_𝑣$ matrix, showcasing flexibility in compression rates across layers. 3. There are comprehensive MKV evaluations across cache budgets, which reveals substantial improvements, particularly under ext
1. Lack of Runtime Evaluation: The absence of runtime metrics makes it challenging to assess the practical benefits of this method fully (see Questions). 2. Missing State-of-the-Art Comparisons: Unusually, the paper doesn’t thoroughly compare to existing state-of-the-art methods. Although it mentioned the other methods may collapse under 60% cache budget (lines 126-131), a comparison with Eigen-Attention and HeadKV at different cache budgets and tasks in terms of both performance and runtime w
1. The paper is easy to follow. 2. Much stronger performance than the PCA baseline when the compression ratio is low.
1. It is unclear whether the novelty of the paper is significant. 2. The paper does not compare with the methods that compress the other dimensions. Thus, it is unclear whether the proposed method is more effective. It is also unclear whether the proposed method can be combined with the others while maintaining its effectiveness.
The paper tackles the problem of KV cache compression in LLMs from a new angle by focusing on the feature dimension. While prior work has explored compressing along the layer, head, and sequence length dimensions, this work shows that significant compression gains can also be achieved along the feature axis. This opens up a promising new direction for efficient LLM inference. The MatryoshkaKV method demonstrates impressive performance in experiments. It can compress KV caches by 60-75% on avera
The paper lacks rigorous theoretical analysis of why their proposed MatryoshkaKV method works better than PCA-based approaches While they provide some error analysis in Appendix A, it's relatively brief and doesn't fully explain the theoretical underpinnings of their method's superior performance Limited evaluation on very long sequence tasks where KV cache compression would be most valuable
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Wireless Communication Techniques · Advanced Data Compression Techniques · Wireless Communication Networks Research
MethodsPrincipal Components Analysis · Focus
