Your Transformer is Secretly Linear
Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Nikolai, Gerasimenko, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov

TL;DR
This paper uncovers a surprising linearity in transformer decoders, demonstrating that certain linear approximations do not impair performance and proposing a regularization method to reduce linearity, thereby challenging current understanding of transformer operations.
Contribution
The study reveals a novel linear characteristic in transformer layers and introduces a regularization technique that reduces linearity while improving benchmark performance.
Findings
Linear relationships between sequential transformer layers with high similarity scores.
Removing or approximating linear blocks does not significantly affect model performance.
Regularization based on cosine similarity reduces linearity and enhances benchmark results.
Abstract
This paper reveals a novel linear characteristic exclusive to transformer decoders, including models such as GPT, LLaMA, OPT, BLOOM and others. We analyze embedding transformations between sequential layers, uncovering a near-perfect linear relationship (Procrustes similarity score of 0.99). However, linearity decreases when the residual component is removed due to a consistently low output norm of the transformer layer. Our experiments show that removing or linearly approximating some of the most linear blocks of transformers does not affect significantly the loss or model performance. Moreover, in our pretraining experiments on smaller models we introduce a cosine-similarity-based regularization, aimed at reducing layer linearity. This regularization improves performance metrics on benchmarks like Tiny Stories and SuperGLUE and as well successfully decreases the linearity of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsPhysics and Engineering Research Articles
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Cosine Annealing · Linear Layer · Weight Decay · Linear Warmup With Cosine Annealing · Residual Connection · Byte Pair Encoding · Adam · Dropout
