Inductive Bias of Gradient Descent for Weight Normalized Smooth Homogeneous Neural Nets
Depen Morwani, Harish G. Ramaswamy

TL;DR
This paper analyzes how gradient descent behaves with weight normalized smooth homogeneous neural networks, revealing differences between standard and exponential normalization, and showing EWN's tendency toward sparse solutions beneficial for pruning.
Contribution
It provides a theoretical analysis of the inductive bias of gradient descent with weight normalization, especially EWN, and establishes convergence rates and sparsity tendencies.
Findings
EWN gradient flow is equivalent to adaptive learning rate on standard networks
EWN promotes asymptotic relative sparsity in weights
Experimental results support sparse solutions with EWN even under SGD
Abstract
We analyze the inductive bias of gradient descent for weight normalized smooth homogeneous neural nets, when trained on exponential or cross-entropy loss. We analyse both standard weight normalization (SWN) and exponential weight normalization (EWN), and show that the gradient flow path with EWN is equivalent to gradient flow on standard networks with an adaptive learning rate. We extend these results to gradient descent, and establish asymptotic relations between weights and gradients for both SWN and EWN. We also show that EWN causes weights to be updated in a way that prefers asymptotic relative sparsity. For EWN, we provide a finite-time convergence rate of the loss with gradient flow and a tight asymptotic convergence rate with gradient descent. We demonstrate our results for SWN and EWN on synthetic data sets. Experimental results on simple datasets support our claim on sparse EWN…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and ELM · Neural Networks and Applications · Stochastic Gradient Optimization Techniques
MethodsWeight Normalization · Stochastic Gradient Descent
