Weight Rescaling: Effective and Robust Regularization for Deep Neural Networks with Batch Normalization
Ziquan Liu, Yufei Cui, Jia Wan, Yu Mao, Antoni B. Chan

TL;DR
This paper introduces Weight Rescaling (WRS), a simple regularization method for deep neural networks with batch normalization, addressing weight decay issues by controlling weight norms to improve generalization and robustness across various vision tasks.
Contribution
The paper proposes WRS, a novel weight normalization scheme that outperforms traditional weight decay and other methods in terms of robustness and effectiveness.
Findings
WRS improves generalization across multiple vision tasks.
WRS is more robust to hyperparameter choices than weight decay.
WRS outperforms weight decay, weight standardization, and AdamP in experiments.
Abstract
Weight decay is often used to ensure good generalization in the training practice of deep neural networks with batch normalization (BN-DNNs), where some convolution layers are invariant to weight rescaling due to the normalization. In this paper, we demonstrate that the practical usage of weight decay still has some unsolved problems in spite of existing theoretical work on explaining the effect of weight decay in BN-DNNs. On the one hand, when the non-adaptive learning rate e.g. SGD with momentum is used, the effective learning rate continues to increase even after the initial training stage, which leads to an overfitting effect in many neural architectures. On the other hand, in both SGDM and adaptive learning rate optimizers e.g. Adam, the effect of weight decay on generalization is quite sensitive to the hyperparameter. Thus, finding an optimal weight decay parameter requires…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Machine Learning and ELM
MethodsWeight Decay · Batch Normalization · Stochastic Gradient Descent · Convolution
