Understanding the Difficulty of Training Transformers

Liyuan Liu; Xiaodong Liu; Jianfeng Gao; Weizhu Chen; Jiawei Han

arXiv:2004.08249·cs.LG·October 3, 2023·28 cites

Understanding the Difficulty of Training Transformers

Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, Jiawei Han

PDF

Open Access 2 Repos

TL;DR

This paper investigates the challenges in training Transformer models, revealing that dependency on residual branches causes instability, and proposes Admin, an initialization method, to improve training stability and performance.

Contribution

The paper identifies the amplification effect caused by residual dependencies as a key factor in Transformer training difficulty and introduces Admin, a new initialization method to address this issue.

Findings

01

Admin stabilizes early training stages

02

Admin accelerates convergence

03

Admin improves final model performance

Abstract

Transformers have proved effective in many NLP tasks. However, their training requires non-trivial efforts regarding designing cutting-edge optimizers and learning rate schedulers carefully (e.g., conventional SGD fails to train Transformers effectively). Our objective here is to understand $what complicates Transformer training$ from both empirical and theoretical perspectives. Our analysis reveals that unbalanced gradients are not the root cause of the instability of training. Instead, we identify an amplification effect that influences training substantially -- for each layer in a multi-layer Transformer model, heavy dependency on its residual branch makes training unstable, since it amplifies small parameter perturbations (e.g., parameter updates) and results in significant disturbances in the model output. Yet we observe that a light dependency limits the model potential…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning · Machine Learning and Data Classification

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax