Your Transformer is Secretly Linear

Anton Razzhigaev; Matvey Mikhalchuk; Elizaveta Goncharova; Nikolai; Gerasimenko; Ivan Oseledets; Denis Dimitrov; Andrey Kuznetsov

arXiv:2405.12250·cs.LG·May 22, 2024

Your Transformer is Secretly Linear

Anton Razzhigaev, Matvey Mikhalchuk, Elizaveta Goncharova, Nikolai, Gerasimenko, Ivan Oseledets, Denis Dimitrov, Andrey Kuznetsov

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper uncovers a surprising linearity in transformer decoders, demonstrating that certain linear approximations do not impair performance and proposing a regularization method to reduce linearity, thereby challenging current understanding of transformer operations.

Contribution

The study reveals a novel linear characteristic in transformer layers and introduces a regularization technique that reduces linearity while improving benchmark performance.

Findings

01

Linear relationships between sequential transformer layers with high similarity scores.

02

Removing or approximating linear blocks does not significantly affect model performance.

03

Regularization based on cosine similarity reduces linearity and enhances benchmark results.

Abstract

This paper reveals a novel linear characteristic exclusive to transformer decoders, including models such as GPT, LLaMA, OPT, BLOOM and others. We analyze embedding transformations between sequential layers, uncovering a near-perfect linear relationship (Procrustes similarity score of 0.99). However, linearity decreases when the residual component is removed due to a consistently low output norm of the transformer layer. Our experiments show that removing or linearly approximating some of the most linear blocks of transformers does not affect significantly the loss or model performance. Moreover, in our pretraining experiments on smaller models we introduce a cosine-similarity-based regularization, aimed at reducing layer linearity. This regularization improves performance metrics on benchmarks like Tiny Stories and SuperGLUE and as well successfully decreases the linearity of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AIRI-Institute/LLM-Microscope
pytorchOfficial

Videos

Your Transformer is Secretly Linear· underline

Taxonomy

TopicsPhysics and Engineering Research Articles

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Cosine Annealing · Linear Layer · Weight Decay · Linear Warmup With Cosine Annealing · Residual Connection · Byte Pair Encoding · Adam · Dropout