L2 Regularization versus Batch and Weight Normalization
Twan van Laarhoven

TL;DR
This paper investigates the interaction between L2 regularization and normalization techniques like Batch Normalization in deep neural networks, revealing that L2 regularization mainly affects weight scale and learning rate rather than regularization.
Contribution
It demonstrates that L2 regularization does not regularize when combined with normalization and explores how it influences weight scale and effective learning rate both theoretically and experimentally.
Findings
L2 regularization has no regularizing effect with normalization.
Regularization influences weight scale and effective learning rate.
ADAM optimizer partially mitigates normalization's impact on learning rate.
Abstract
Batch Normalization is a commonly used trick to improve the training of deep neural networks. These neural networks use L2 regularization, also called weight decay, ostensibly to prevent overfitting. However, we show that L2 regularization has no regularizing effect when combined with normalization. Instead, regularization has an influence on the scale of weights, and thereby on the effective learning rate. We investigate this dependence, both in theory, and experimentally. We show that popular optimization methods such as ADAM only partially eliminate the influence of normalization on the learning rate. This leads to a discussion on other ways to mitigate this issue.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Machine Learning and Algorithms
MethodsAdam
