Understanding the Disharmony between Weight Normalization Family and Weight Decay: $\epsilon-$shifted $L_2$ Regularizer
Li Xiang, Chen Shuo, Xia Yan, Yang Jian

TL;DR
This paper investigates the interaction between weight normalization and weight decay, revealing theoretical insights and proposing an $oldsymbol{ ext{ extit{ extepsilon}}}$-shifted $L_2$ regularizer to improve training stability and performance.
Contribution
It provides a theoretical analysis of weight decay in the context of weight normalization and introduces a novel $ ext{ extit{ extepsilon}}$-shifted $L_2$ regularizer to address related training issues.
Findings
Weight decay only affects the effective learning rate, not generalization.
Introducing weight decay with weight normalization can cause missing global minima and instability.
The $ ext{ extit{ extepsilon}}$-shifted $L_2$ regularizer guarantees global minimum existence and improves training stability.
Abstract
The merits of fast convergence and potentially better performance of the weight normalization family have drawn increasing attention in recent years. These methods use standardization or normalization that changes the weight to , which makes independent to the magnitude of . Surprisingly, must be decayed during gradient descent, otherwise we will observe a severe under-fitting problem, which is very counter-intuitive since weight decay is widely known to prevent deep networks from over-fitting. In this paper, we \emph{theoretically} prove that the weight decay term merely modulates the effective learning rate for improving objective optimization, and has no influence on generalization when the weight normalization family is compositely employed. Furthermore, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
MethodsWeight Decay · Weight Normalization
