Linear attention is (maybe) all you need (to understand transformer optimization)
Kwangjun Ahn, Xiang Cheng, Minhak Song, Chulhee Yun, Ali Jadbabaie,, Suvrit Sra

TL;DR
This paper investigates the training dynamics of Transformers by analyzing a simplified linearized model, revealing that such models can effectively mimic key aspects of actual Transformer training and thus serve as useful tools for understanding optimization.
Contribution
The paper demonstrates that a simple linearized Transformer model can replicate important training behaviors, providing a new approach to understanding Transformer optimization.
Findings
Linearized models reproduce key training dynamics of Transformers.
Simplified models can serve as realistic abstractions for optimization analysis.
Results suggest linearized Transformers are valuable for understanding training processes.
Abstract
Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training Transformers by carefully studying a simple yet canonical linearized shallow Transformer model. Specifically, we train linear Transformers to solve regression tasks, inspired by J.~von Oswald et al.~(ICML 2023), and K.~Ahn et al.~(NeurIPS 2023). Most importantly, we observe that our proposed linearized models can reproduce several prominent aspects of Transformer training dynamics. Consequently, the results obtained in this paper suggest that a simple linearized Transformer model could actually be a valuable, realistic abstraction for understanding Transformer optimization.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEnergy Load and Power Forecasting
MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Layer Normalization · Absolute Position Encodings · Dropout · Softmax · Residual Connection
