On the Generalization Mystery in Deep Learning

Satrajit Chatterjee; Piotr Zielinski

arXiv:2203.10036·cs.LG·June 7, 2022·23 cites

On the Generalization Mystery in Deep Learning

Satrajit Chatterjee, Piotr Zielinski

PDF

Open Access

TL;DR

This paper proposes that the coherence of per-example gradients during training explains why over-parameterized neural networks generalize well, offering a metric to predict generalization and insights into training dynamics.

Contribution

It introduces a new, interpretable metric for gradient coherence that explains generalization, early learning, and noise robustness in deep neural networks.

Findings

01

Gradient coherence differs significantly between real and random datasets.

02

The metric predicts which solutions will generalize well.

03

Modifications to gradient descent can improve generalization.

Abstract

The generalization mystery in deep learning is the following: Why do over-parameterized neural networks trained with gradient descent (GD) generalize well on real datasets even though they are capable of fitting random datasets of comparable size? Furthermore, from among all solutions that fit the training data, how does GD find one that generalizes well (when such a well-generalizing solution exists)? We argue that the answer to both questions lies in the interaction of the gradients of different examples during training. Intuitively, if the per-example gradients are well-aligned, that is, if they are coherent, then one may expect GD to be (algorithmically) stable, and hence generalize well. We formalize this argument with an easy to compute and interpretable metric for coherence, and show that the metric takes on very different values on real and random datasets for several common…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Stochastic Gradient Optimization Techniques · Machine Learning and Data Classification

MethodsEarly Stopping