Beyond Linearity in Attention Projections: The Case for Nonlinear Queries
Marko Karbevski

TL;DR
This paper proposes replacing linear query projections in transformers with a nonlinear residual form, leading to improved performance in small-scale GPT-3 style models and suggesting benefits for larger models.
Contribution
Introducing a nonlinear residual approach for query projections in transformers, demonstrating consistent performance gains over linear baselines in small-scale experiments.
Findings
Nonlinear residual queries improve validation log-loss by 2.40%.
Perplexity decreases by 6.81% with the proposed method.
Outperforms a larger model with 12.5% more parameters.
Abstract
Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection may be set to identity without noticeable performance deterioration. This is possible because attention depends on only through the products , allowing basis transformations to be absorbed by adjacent layers and propagated through the network. We replace with a nonlinear residual of the form , where is a bottleneck MLP with parameters. The identity term anchors the nonlinearity to a known-good prior. Experiments on GPT-3 small style models show consistent improvement over the baseline ( lower validation log-loss, lower perplexity), comfortably outperforming a model with 12.5\% more non-embedding parameters. These results motivate investigation at larger scales and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
