Low-Rank Key Value Attention

James O'Neill; Robert Clancy; Mariia Matskevichus; Fergal Reid

arXiv:2601.11471·cs.LG·April 9, 2026

Low-Rank Key Value Attention

James O'Neill, Robert Clancy, Mariia Matskevichus, Fergal Reid

PDF

TL;DR

The paper introduces Low-Rank Key-Value (LRKV) attention, a method that reduces memory usage in Transformers by exploiting redundancy across attention heads, while maintaining or improving performance.

Contribution

LRKV provides a novel approach to reduce KV cache memory in Transformers by combining shared full-rank projections with low-rank residuals, enabling efficient training.

Findings

01

LRKV achieves the lowest test loss among standard attention methods.

02

LRKV uses only 45-53% of the KV cache compared to MHA.

03

LRKV reaches baseline quality 18-25% faster in training steps.

Abstract

The key-value (KV) cache is a primary memory bottleneck in Transformers. We propose Low-Rank Key-Value (LRKV) attention, which reduces KV cache memory by exploiting redundancy across attention heads, while being compute efficient. Each layer uses a shared full-rank KV projection augmented with low-rank, head-specific residuals, providing a continuous trade-off between complete sharing and full independence. After pretraining models of size 128M to 6.3B parameters, LRKV consistently achieves the lowest test loss among standard MHA, MQA/GQA, and MLA while using only 45-53\% of MHA's KV cache. LRKV reaches equivalent baseline quality 18-25\% faster (measured in training steps). After supervised midtraining, LRKV achieves the highest downstream task performance across ARC-Easy, ARC-Challenge, MMLU, GSM8K, and HumanEval benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.