Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

Marko Karbevski

arXiv:2603.13381·cs.LG·April 27, 2026

Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

Marko Karbevski

PDF

TL;DR

This paper proposes replacing linear query projections in transformers with a nonlinear residual form, leading to improved performance in small-scale GPT-3 style models and suggesting benefits for larger models.

Contribution

Introducing a nonlinear residual approach for query projections in transformers, demonstrating consistent performance gains over linear baselines in small-scale experiments.

Findings

01

Nonlinear residual queries improve validation log-loss by 2.40%.

02

Perplexity decreases by 6.81% with the proposed method.

03

Outperforms a larger model with 12.5% more parameters.

Abstract

Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection $W_{Q}$ may be set to identity without noticeable performance deterioration. This is possible because attention depends on $X$ only through the products $X W_{Q}, X W_{K}, X W_{V}$ , allowing basis transformations to be absorbed by adjacent layers and propagated through the network. We replace $W_{Q} \in R^{d \times d}$ with a nonlinear residual of the form $Q (X) = X + f_{θ} (X)$ , where $f_{θ}$ is a bottleneck MLP with $d^{2} + O (d)$ parameters. The identity term anchors the nonlinearity to a known-good prior. Experiments on GPT-3 small style models show consistent improvement over the baseline ( $2.40%$ lower validation log-loss, $6.81%$ lower perplexity), comfortably outperforming a model with 12.5\% more non-embedding parameters. These results motivate investigation at larger scales and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.