On the Implicit Bias Towards Minimal Depth of Deep Neural Networks

Tomer Galanti; Liane Galanti; Ido Ben-Shaul

arXiv:2202.09028·cs.LG·September 29, 2022·1 cites

On the Implicit Bias Towards Minimal Depth of Deep Neural Networks

Tomer Galanti, Liane Galanti, Ido Ben-Shaul

PDF

Open Access

TL;DR

This paper investigates how stochastic gradient descent implicitly favors shallow neural networks by analyzing an effective depth measure, linking neural collapse, generalization, and label corruption effects.

Contribution

It introduces the notion of effective depth, demonstrates SGD's bias towards low-depth solutions, and connects intermediate layer separability with generalization performance.

Findings

01

SGD implicitly favors neural networks with small effective depths.

02

Neural collapse occurs even when generalization is not possible.

03

Effective depth increases with the number of random labels in data.

Abstract

Recent results in the literature suggest that the penultimate (second-to-last) layer representations of neural networks that are trained for classification exhibit a clustering property called neural collapse (NC). We study the implicit bias of stochastic gradient descent (SGD) in favor of low-depth solutions when training deep neural networks. We characterize a notion of effective depth that measures the first layer for which sample embeddings are separable using the nearest-class center classifier. Furthermore, we hypothesize and empirically show that SGD implicitly selects neural networks of small effective depths. Secondly, while neural collapse emerges even when generalization should be impossible - we argue that the \emph{degree of separability} in the intermediate layers is related to generalization. We derive a generalization bound based on comparing the effective depth of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning

MethodsStochastic Gradient Descent