KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity

Damien Lesens; Beheshteh T. Rakhshan; Guillaume Rabusseau

arXiv:2512.05916·cs.LG·December 8, 2025

KQ-SVD: Compressing the KV Cache with Provable Guarantees on Attention Fidelity

Damien Lesens, Beheshteh T. Rakhshan, Guillaume Rabusseau

PDF

Open Access

TL;DR

KQ-SVD offers a provably optimal low-rank approximation of the attention matrix in transformers, significantly reducing memory usage while maintaining high fidelity of attention outputs.

Contribution

This paper introduces KQ-SVD, a novel method that directly decomposes the attention matrix to improve compression fidelity with theoretical guarantees.

Findings

01

Outperforms prior methods in preserving attention accuracy

02

Reduces memory bottleneck in large language models

03

Demonstrates superior results on LLaMA and Mistral models

Abstract

The Key-Value (KV) cache is central to the efficiency of transformer-based large language models (LLMs), storing previously computed vectors to accelerate inference. Yet, as sequence length and batch size grow, the cache becomes a major memory bottleneck. Prior compression methods typically apply low-rank decomposition to keys alone or attempt to jointly embed queries and keys, but both approaches neglect that attention fundamentally depends on their inner products. In this work, we prove that such strategies are suboptimal for approximating the attention matrix. We introduce KQ-SVD, a simple and computationally efficient method that directly performs an optimal low-rank decomposition of the attention matrix via a closed-form solution. By targeting the true source of redundancy, KQ-SVD preserves attention outputs with higher fidelity under compression. Extensive evaluations on LLaMA and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBig Data and Digital Economy · Natural Language Processing Techniques · Advanced Neural Network Applications