On Layer Normalization in the Transformer Architecture
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen, Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu

TL;DR
This paper analyzes how the placement of layer normalization in Transformers affects training stability and proposes that Pre-LN Transformers can be trained without warm-up stages, reducing training time and hyper-parameter tuning.
Contribution
The paper provides a theoretical explanation for the effectiveness of layer normalization placement and demonstrates that Pre-LN Transformers can be trained without warm-up, unlike Post-LN models.
Findings
Pre-LN Transformers have well-behaved gradients at initialization.
Removing warm-up stages does not harm performance for Pre-LN models.
Pre-LN models train faster and require less hyper-parameter tuning.
Abstract
The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyper-parameter tunings. In this paper, we first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, using a large learning rate on those gradients makes the training unstable. The warm-up stage is practically helpful for avoiding this problem. On the other hand, our theory also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗AIRI-Institute/gena-lm-bert-basemodel· 200 dl· ♡ 29200 dl♡ 29
- 🤗AIRI-Institute/gena-lm-bert-base-t2tmodel· 1.3k dl· ♡ 21.3k dl♡ 2
- 🤗AIRI-Institute/gena-lm-bert-base-lastln-t2tmodel· 52 dl· ♡ 152 dl♡ 1
- 🤗AIRI-Institute/gena-lm-bert-base-t2t-multimodel· 182 dl· ♡ 3182 dl♡ 3
- 🤗AIRI-Institute/gena-lm-bert-large-t2tmodel· 1.1k dl· ♡ 91.1k dl♡ 9
- 🤗AIRI-Institute/gena-lm-bigbird-base-sparsemodel· 149 dl· ♡ 3149 dl♡ 3
- 🤗AIRI-Institute/gena-lm-bigbird-base-sparse-t2tmodel· 131 dl· ♡ 5131 dl♡ 5
- 🤗AIRI-Institute/gena-lm-bert-base-flymodel· 19 dl19 dl
- 🤗AIRI-Institute/gena-lm-bert-base-yeastmodel· 31 dl· ♡ 131 dl♡ 1
- 🤗AIRI-Institute/gena-lm-bert-base-athalianamodel· 87 dl87 dl
Videos
Taxonomy
TopicsPower Transformer Diagnostics and Insulation · Magnetic Properties and Applications · Power Quality and Harmonics
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax
