Stabilizing Transformer Training by Preventing Attention Entropy Collapse
Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge,, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, Josh Susskind

TL;DR
This paper identifies attention entropy collapse as a key factor in Transformer training instability and proposes sigmaReparam, a spectral normalization-based method, to stabilize training across various tasks and architectures.
Contribution
The paper introduces sigmaReparam, a spectral normalization technique that prevents attention entropy collapse, enhancing training stability and enabling competitive performance without common training heuristics.
Findings
sigmaReparam prevents attention entropy collapse across tasks
Enables training of Transformers without warmup or weight decay
Improves stability and robustness of Transformer training
Abstract
Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the attention entropy for each attention head during the course of training, which is a proxy for model sharpness. We identify a common pattern across different architectures and tasks, where low attention entropy is accompanied by high training instability, which can take the form of oscillating loss or divergence. We denote the pathologically low attention entropy, corresponding to highly concentrated attention scores, as . As a remedy, we propose Reparam, a simple and efficient solution where we reparametrize all linear layers with spectral normalization and an additional learned scalar. We demonstrate that Reparam successfully prevents…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Memory and Neural Computing
MethodsMulti-Head Attention · Attention Is All You Need · Residual Connection · Dense Connections · Vision Transformer · Absolute Position Encodings · Linear Layer · Label Smoothing · Dropout · Adam
