KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity
Damien Lesens, Beheshteh T. Rakhshan, Guillaume Rabusseau

TL;DR
KQ-SVD offers a provably optimal low-rank approximation of the attention matrix in transformers, significantly reducing memory usage while maintaining high fidelity of attention outputs.
Contribution
This paper introduces KQ-SVD, a novel method that directly decomposes the attention matrix to improve compression fidelity with theoretical guarantees.
Findings
Outperforms prior methods in preserving attention accuracy
Reduces memory bottleneck in large language models
Demonstrates superior results on LLaMA and Mistral models
Abstract
The Key-Value (KV) cache is central to the efficiency of transformer-based large language models (LLMs), storing previously computed vectors to accelerate inference. Yet, as sequence length and batch size grow, the cache becomes a major memory bottleneck. Prior compression methods typically apply low-rank decomposition to keys alone or attempt to jointly embed queries and keys, but both approaches neglect that attention fundamentally depends on their inner products. In this work, we prove that such strategies are suboptimal for approximating the attention matrix. We introduce KQ-SVD, a simple and computationally efficient method that directly performs an optimal low-rank decomposition of the attention matrix via a closed-form solution. By targeting the true source of redundancy, KQ-SVD preserves attention outputs with higher fidelity under compression. Extensive evaluations on LLaMA and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBig Data and Digital Economy · Natural Language Processing Techniques · Advanced Neural Network Applications
