On the Implicit Bias Towards Minimal Depth of Deep Neural Networks
Tomer Galanti, Liane Galanti, Ido Ben-Shaul

TL;DR
This paper investigates how stochastic gradient descent implicitly favors shallow neural networks by analyzing an effective depth measure, linking neural collapse, generalization, and label corruption effects.
Contribution
It introduces the notion of effective depth, demonstrates SGD's bias towards low-depth solutions, and connects intermediate layer separability with generalization performance.
Findings
SGD implicitly favors neural networks with small effective depths.
Neural collapse occurs even when generalization is not possible.
Effective depth increases with the number of random labels in data.
Abstract
Recent results in the literature suggest that the penultimate (second-to-last) layer representations of neural networks that are trained for classification exhibit a clustering property called neural collapse (NC). We study the implicit bias of stochastic gradient descent (SGD) in favor of low-depth solutions when training deep neural networks. We characterize a notion of effective depth that measures the first layer for which sample embeddings are separable using the nearest-class center classifier. Furthermore, we hypothesize and empirically show that SGD implicitly selects neural networks of small effective depths. Secondly, while neural collapse emerges even when generalization should be impossible - we argue that the \emph{degree of separability} in the intermediate layers is related to generalization. We derive a generalization bound based on comparing the effective depth of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning
MethodsStochastic Gradient Descent
