L2 Regularization versus Batch and Weight Normalization

Twan van Laarhoven

arXiv:1706.05350·cs.LG·June 19, 2017·210 cites

L2 Regularization versus Batch and Weight Normalization

Twan van Laarhoven

PDF

Open Access

TL;DR

This paper investigates the interaction between L2 regularization and normalization techniques like Batch Normalization in deep neural networks, revealing that L2 regularization mainly affects weight scale and learning rate rather than regularization.

Contribution

It demonstrates that L2 regularization does not regularize when combined with normalization and explores how it influences weight scale and effective learning rate both theoretically and experimentally.

Findings

01

L2 regularization has no regularizing effect with normalization.

02

Regularization influences weight scale and effective learning rate.

03

ADAM optimizer partially mitigates normalization's impact on learning rate.

Abstract

Batch Normalization is a commonly used trick to improve the training of deep neural networks. These neural networks use L2 regularization, also called weight decay, ostensibly to prevent overfitting. However, we show that L2 regularization has no regularizing effect when combined with normalization. Instead, regularization has an influence on the scale of weights, and thereby on the effective learning rate. We investigate this dependence, both in theory, and experimentally. We show that popular optimization methods such as ADAM only partially eliminate the influence of normalization on the learning rate. This leads to a discussion on other ways to mitigate this issue.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Machine Learning and Algorithms

MethodsAdam