Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

Hengshuai Yao; Xing Chen; Ahmed Murtadha; Guan Wang

arXiv:2603.04427·cs.LG·March 31, 2026

Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

Hengshuai Yao, Xing Chen, Ahmed Murtadha, Guan Wang

PDF

TL;DR

This paper introduces factored keys, a low-dimensional attention selection method that significantly reduces KV cache size in Transformers with minimal performance loss, enabling more efficient large-scale language models.

Contribution

It proposes a novel factorization of key projections using SVD to compress KV cache without retraining, outperforming prior methods in efficiency and compatibility.

Findings

01

Matching full-attention perplexity with 12% fewer parameters at 8% faster training.

02

Achieving 75% key cache savings with less than 2% quality loss on GPT-2 and Mistral-7B.

03

Enabling 60% more concurrent users by reducing cache size for large models.

Abstract

Standard Transformer attention uses identical dimensionality for queries, keys, and values, yet these components serve different roles: queries and keys produce scalar attention weights (selection), while values carry rich representations (value transfer). We show that selection requires only $O (lo g N)$ dimensions to distinguish among $N$ relevant token categories (e.g., syntactic roles, semantic clusters, positional patterns) -- far fewer than value transfer needs. We introduce factored keys, which exploit this asymmetry to physically shrink the KV cache of any pretrained model without retraining from scratch -- unlike Grouped-Query Attention (GQA) and Multi-Head Latent Attention (MLA), which must be designed into the architecture before pretraining. We factorize each key projection $W_{K} \approx A_{d \times r} B_{r \times d}$ via truncated singular value decomposition (SVD) (where…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.