Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers

Marko Karbevski; Antonij Mijoski

arXiv:2510.23912·cs.LG·April 24, 2026

Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers

Marko Karbevski, Antonij Mijoski

PDF

TL;DR

This paper shows that in self-attention transformers, one of the Query, Key, or Value weights can be redundant, leading to simpler models with fewer parameters without performance loss.

Contribution

It proves the redundancy of one weight in the triplet under mild assumptions and validates this reduction in small GPT-style models, improving efficiency.

Findings

01

Removing one weight preserves baseline performance.

02

Reduced models outperform baselines when parameters are reallocated.

03

Attention logits depend on a single learned weight matrix after reduction.

Abstract

We theoretically investigate whether the Query, Key, Value weight triplet can be reduced in encoder-only and decoder-only transformers. Under mild assumptions, we prove that one of the Query, Key or Value weights are redundant and can be replaced with the identity matrix, reducing attention parameters by 25\%. If applied to the Query or Key weights, this also simplifies optimization: attention logits depend on a single learned weight matrix rather than on a product of two. Validating the Query weight removal on decoder-only GPT-style small models trained from scratch, we find that reduced models match baseline performance despite fewer parameters, and outperform baselines when saved parameters are reallocated. Our analysis has also led us to a structural expressivity boundary: in the mathematically tractable ReLU setting, skip connections push MLPs into a generically disjoint function…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.