Implicit Regularization and Convergence for Weight Normalization
Xiaoxia Wu, Edgar Dobriban, Tongzheng Ren, Shanshan Wu and, Zhiyuan Li, Suriya Gunasekar, Rachel Ward, Qiang Liu

TL;DR
This paper analyzes weight normalization and a variant called reparametrized projected gradient descent, showing they act as implicit regularizers that guide convergence towards minimum norm solutions in overparametrized least-squares regression, with less sensitivity to initialization.
Contribution
It demonstrates that weight normalization and rPGD implicitly regularize and converge near minimum norm solutions, differing from standard gradient descent behavior.
Findings
Weight normalization and rPGD regularize weights adaptively.
These methods converge close to minimum norm solutions.
They are less sensitive to initializations compared to gradient descent.
Abstract
Normalization methods such as batch [Ioffe and Szegedy, 2015], weight [Salimansand Kingma, 2016], instance [Ulyanov et al., 2016], and layer normalization [Baet al., 2016] have been widely used in modern machine learning. Here, we study the weight normalization (WN) method [Salimans and Kingma, 2016] and a variant called reparametrized projected gradient descent (rPGD) for overparametrized least-squares regression. WN and rPGD reparametrize the weights with a scale g and a unit vector w and thus the objective function becomes non-convex. We show that this non-convex formulation has beneficial regularization effects compared to gradient descent on the original objective. These methods adaptively regularize the weights and converge close to the minimum l2 norm solution, even for initializations far from zero. For certain stepsizes of g and w , we show that they can converge close to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSparse and Compressive Sensing Techniques · Stochastic Gradient Optimization Techniques · Face and Expression Recognition
MethodsLayer Normalization · Weight Normalization · Batch Normalization
