Stabilizing Transformer Training by Preventing Attention Entropy   Collapse

Shuangfei Zhai; Tatiana Likhomanenko; Etai Littwin; Dan Busbridge,; Jason Ramapuram; Yizhe Zhang; Jiatao Gu; Josh Susskind

arXiv:2303.06296·cs.LG·July 26, 2023·6 cites

Stabilizing Transformer Training by Preventing Attention Entropy Collapse

Shuangfei Zhai, Tatiana Likhomanenko, Etai Littwin, Dan Busbridge,, Jason Ramapuram, Yizhe Zhang, Jiatao Gu, Josh Susskind

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper identifies attention entropy collapse as a key factor in Transformer training instability and proposes sigmaReparam, a spectral normalization-based method, to stabilize training across various tasks and architectures.

Contribution

The paper introduces sigmaReparam, a spectral normalization technique that prevents attention entropy collapse, enhancing training stability and enabling competitive performance without common training heuristics.

Findings

01

sigmaReparam prevents attention entropy collapse across tasks

02

Enables training of Transformers without warmup or weight decay

03

Improves stability and robustness of Transformer training

Abstract

Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the attention entropy for each attention head during the course of training, which is a proxy for model sharpness. We identify a common pattern across different architectures and tasks, where low attention entropy is accompanied by high training instability, which can take the form of oscillating loss or divergence. We denote the pathologically low attention entropy, corresponding to highly concentrated attention scores, as $entropy collapse$ . As a remedy, we propose $σ$ Reparam, a simple and efficient solution where we reparametrize all linear layers with spectral normalization and an additional learned scalar. We demonstrate that $σ$ Reparam successfully prevents…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

apple/ml-sigma-reparam
jaxOfficial

Videos

Stabilizing Transformer Training by Preventing Attention Entropy Collapse· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Memory and Neural Computing

MethodsMulti-Head Attention · Attention Is All You Need · Residual Connection · Dense Connections · Vision Transformer · Absolute Position Encodings · Linear Layer · Label Smoothing · Dropout · Adam