Understanding and Improving Layer Normalization

Jingjing Xu; Xu Sun; Zhiyuan Zhang; Guangxiang Zhao; Junyang Lin

arXiv:1911.07013·cs.LG·November 19, 2019·177 cites

Understanding and Improving Layer Normalization

Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, Junyang Lin

PDF

Open Access 2 Repos

TL;DR

This paper provides a new understanding of Layer Normalization, emphasizing the importance of derivatives over forward normalization, and introduces AdaNorm, a novel normalization method that improves performance and reduces overfitting.

Contribution

It reveals the critical role of derivatives in LayerNorm's effectiveness and proposes AdaNorm, a new normalization technique that outperforms traditional LayerNorm on multiple datasets.

Findings

01

Re-centering and re-scaling derivatives are more important than forward normalization.

02

Removing bias and gain improves LayerNorm performance.

03

AdaNorm outperforms LayerNorm on seven of eight datasets.

Abstract

Layer normalization (LayerNorm) is a technique to normalize the distributions of intermediate layers. It enables smoother gradients, faster training, and better generalization accuracy. However, it is still unclear where the effectiveness stems from. In this paper, our main contribution is to take a step further in understanding LayerNorm. Many of previous studies believe that the success of LayerNorm comes from forward normalization. Unlike them, we find that the derivatives of the mean and variance are more important than forward normalization by re-centering and re-scaling backward gradients. Furthermore, we find that the parameters of LayerNorm, including the bias and gain, increase the risk of over-fitting and do not work in most cases. Experiments show that a simple version of LayerNorm (LayerNorm-simple) without the bias and gain outperforms LayerNorm on four datasets. It obtains…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text and Document Classification Technologies · Speech Recognition and Synthesis