Why Deep Learning Generalizes
Benjamin L. Badger

TL;DR
This paper investigates why deep learning models tend to generalize well despite their capacity to memorize, exploring the effects of noise, dataset size, and theoretical stability on their generalization behavior.
Contribution
It introduces methods to train models to memorize datasets that normally generalize, and provides a theoretical explanation for the bias towards generalization based on stability during training.
Findings
Memorization is harder than generalization but easier with added noise.
Larger datasets increase overfitting for random data but reduce it for natural images.
Generalization is linked to models' parameters being attracted to stable points during training.
Abstract
Very large deep learning models trained using gradient descent are remarkably resistant to memorization given their huge capacity, but are at the same time capable of fitting large datasets of pure noise. Here methods are introduced by which models may be trained to memorize datasets that normally are generalized. We find that memorization is difficult relative to generalization, but that adding noise makes memorization easier. Increasing the dataset size exaggerates the characteristics of that dataset: model access to more training samples makes overfitting easier for random data, but somewhat harder for natural images. The bias of deep learning towards generalization is explored theoretically, and we show that generalization results from a model's parameters being attracted to points of maximal stability with respect to that model's inputs during gradient descent.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
