How Neural Networks Learn the Support is an Implicit Regularization Effect of SGD
Pierfrancesco Beneventano, Andrea Pinto, Tomaso Poggio

TL;DR
This paper demonstrates that mini-batch SGD implicitly regularizes neural networks to learn the support of the target function by shrinking irrelevant weights, unlike vanilla GD which needs explicit regularization.
Contribution
It reveals a second-order implicit regularization effect of mini-batch SGD that enhances feature interpretability and reduces initialization dependence.
Findings
Mini-batch SGD learns support by shrinking irrelevant weights.
Vanilla GD requires explicit regularization to learn support.
Smaller batch sizes improve feature interpretability.
Abstract
We investigate the ability of deep neural networks to identify the support of the target function. Our findings reveal that mini-batch SGD effectively learns the support in the first layer of the network by shrinking to zero the weights associated with irrelevant components of input. In contrast, we demonstrate that while vanilla GD also approximates the target function, it requires an explicit regularization term to learn the support in the first layer. We prove that this property of mini-batch SGD is due to a second-order implicit regularization effect which is proportional to (step size / batch size). Our results are not only another proof that implicit regularization has a significant impact on training optimization dynamics but they also shed light on the structure of the features that are learned by the network. Additionally, they suggest that smaller batches enhance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Imaging and Analysis
MethodsStochastic Gradient Descent
