Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection
Hengshuai Yao, Xing Chen, Ahmed Murtadha, Guan Wang

TL;DR
This paper introduces factored keys, a low-dimensional attention selection method that significantly reduces KV cache size in Transformers with minimal performance loss, enabling more efficient large-scale language models.
Contribution
It proposes a novel factorization of key projections using SVD to compress KV cache without retraining, outperforming prior methods in efficiency and compatibility.
Findings
Matching full-attention perplexity with 12% fewer parameters at 8% faster training.
Achieving 75% key cache savings with less than 2% quality loss on GPT-2 and Mistral-7B.
Enabling 60% more concurrent users by reducing cache size for large models.
Abstract
Standard Transformer attention uses identical dimensionality for queries, keys, and values, yet these components serve different roles: queries and keys produce scalar attention weights (selection), while values carry rich representations (value transfer). We show that selection requires only dimensions to distinguish among relevant token categories (e.g., syntactic roles, semantic clusters, positional patterns) -- far fewer than value transfer needs. We introduce factored keys, which exploit this asymmetry to physically shrink the KV cache of any pretrained model without retraining from scratch -- unlike Grouped-Query Attention (GQA) and Multi-Head Latent Attention (MLA), which must be designed into the architecture before pretraining. We factorize each key projection via truncated singular value decomposition (SVD) (where…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
