Linear attention is (maybe) all you need (to understand transformer   optimization)

Kwangjun Ahn; Xiang Cheng; Minhak Song; Chulhee Yun; Ali Jadbabaie,; Suvrit Sra

arXiv:2310.01082·cs.LG·March 14, 2024·1 cites

Linear attention is (maybe) all you need (to understand transformer optimization)

Kwangjun Ahn, Xiang Cheng, Minhak Song, Chulhee Yun, Ali Jadbabaie,, Suvrit Sra

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates the training dynamics of Transformers by analyzing a simplified linearized model, revealing that such models can effectively mimic key aspects of actual Transformer training and thus serve as useful tools for understanding optimization.

Contribution

The paper demonstrates that a simple linearized Transformer model can replicate important training behaviors, providing a new approach to understanding Transformer optimization.

Findings

01

Linearized models reproduce key training dynamics of Transformers.

02

Simplified models can serve as realistic abstractions for optimization analysis.

03

Results suggest linearized Transformers are valuable for understanding training processes.

Abstract

Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training Transformers by carefully studying a simple yet canonical linearized shallow Transformer model. Specifically, we train linear Transformers to solve regression tasks, inspired by J.~von Oswald et al.~(ICML 2023), and K.~Ahn et al.~(NeurIPS 2023). Most importantly, we observe that our proposed linearized models can reproduce several prominent aspects of Transformer training dynamics. Consequently, the results obtained in this paper suggest that a simple linearized Transformer model could actually be a valuable, realistic abstraction for understanding Transformer optimization.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

chengxiang/lineartransformer
pytorch

Videos

Linear attention is (maybe) all you need (to understand Transformer optimization)· slideslive

Taxonomy

TopicsEnergy Load and Power Forecasting

MethodsAttention Is All You Need · Linear Layer · Multi-Head Attention · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Layer Normalization · Absolute Position Encodings · Dropout · Softmax · Residual Connection