Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Self-Attention Transformers
Marko Karbevski, Antonij Mijoski

TL;DR
This paper shows that in self-attention transformers, one of the Query, Key, or Value weights can be redundant, leading to simpler models with fewer parameters without performance loss.
Contribution
It proves the redundancy of one weight in the triplet under mild assumptions and validates this reduction in small GPT-style models, improving efficiency.
Findings
Removing one weight preserves baseline performance.
Reduced models outperform baselines when parameters are reallocated.
Attention logits depend on a single learned weight matrix after reduction.
Abstract
We theoretically investigate whether the Query, Key, Value weight triplet can be reduced in encoder-only and decoder-only transformers. Under mild assumptions, we prove that one of the Query, Key or Value weights are redundant and can be replaced with the identity matrix, reducing attention parameters by 25\%. If applied to the Query or Key weights, this also simplifies optimization: attention logits depend on a single learned weight matrix rather than on a product of two. Validating the Query weight removal on decoder-only GPT-style small models trained from scratch, we find that reduced models match baseline performance despite fewer parameters, and outperform baselines when saved parameters are reallocated. Our analysis has also led us to a structural expressivity boundary: in the mathematically tractable ReLU setting, skip connections push MLPs into a generically disjoint function…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
