On Layer Normalization in the Transformer Architecture

Ruibin Xiong; Yunchang Yang; Di He; Kai Zheng; Shuxin Zheng; Chen; Xing; Huishuai Zhang; Yanyan Lan; Liwei Wang; Tie-Yan Liu

arXiv:2002.04745·cs.LG·June 30, 2020·110 cites

On Layer Normalization in the Transformer Architecture

Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen, Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, Tie-Yan Liu

PDF

Open Access 5 Repos 10 Models 1 Video

TL;DR

This paper analyzes how the placement of layer normalization in Transformers affects training stability and proposes that Pre-LN Transformers can be trained without warm-up stages, reducing training time and hyper-parameter tuning.

Contribution

The paper provides a theoretical explanation for the effectiveness of layer normalization placement and demonstrates that Pre-LN Transformers can be trained without warm-up, unlike Post-LN models.

Findings

01

Pre-LN Transformers have well-behaved gradients at initialization.

02

Removing warm-up stages does not harm performance for Pre-LN models.

03

Pre-LN models train faster and require less hyper-parameter tuning.

Abstract

The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyper-parameter tunings. In this paper, we first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, using a large learning rate on those gradients makes the training unstable. The warm-up stage is practically helpful for avoiding this problem. On the other hand, our theory also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

On Layer Normalization in the Transformer Architecture· slideslive

Taxonomy

TopicsPower Transformer Diagnostics and Insulation · Magnetic Properties and Applications · Power Quality and Harmonics

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Byte Pair Encoding · Dense Connections · Label Smoothing · *Communicated@Fast*How Do I Communicate to Expedia? · Adam · Softmax