Understanding and Improving Layer Normalization
Jingjing Xu, Xu Sun, Zhiyuan Zhang, Guangxiang Zhao, Junyang Lin

TL;DR
This paper provides a new understanding of Layer Normalization, emphasizing the importance of derivatives over forward normalization, and introduces AdaNorm, a novel normalization method that improves performance and reduces overfitting.
Contribution
It reveals the critical role of derivatives in LayerNorm's effectiveness and proposes AdaNorm, a new normalization technique that outperforms traditional LayerNorm on multiple datasets.
Findings
Re-centering and re-scaling derivatives are more important than forward normalization.
Removing bias and gain improves LayerNorm performance.
AdaNorm outperforms LayerNorm on seven of eight datasets.
Abstract
Layer normalization (LayerNorm) is a technique to normalize the distributions of intermediate layers. It enables smoother gradients, faster training, and better generalization accuracy. However, it is still unclear where the effectiveness stems from. In this paper, our main contribution is to take a step further in understanding LayerNorm. Many of previous studies believe that the success of LayerNorm comes from forward normalization. Unlike them, we find that the derivatives of the mean and variance are more important than forward normalization by re-centering and re-scaling backward gradients. Furthermore, we find that the parameters of LayerNorm, including the bias and gain, increase the risk of over-fitting and do not work in most cases. Experiments show that a simple version of LayerNorm (LayerNorm-simple) without the bias and gain outperforms LayerNorm on four datasets. It obtains…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text and Document Classification Technologies · Speech Recognition and Synthesis
