High-dimensional dynamics of generalization error in neural networks
Madhu S. Advani, Andrew M. Saxe

TL;DR
This paper analyzes the high-dimensional generalization dynamics of neural networks trained with gradient descent, revealing how network size and initial weights influence overfitting and generalization, with theoretical and empirical insights.
Contribution
It introduces a novel high-dimensional analysis of neural network generalization, identifying phenomena like frozen subspaces and input conditioning that explain overtraining behavior.
Findings
Large networks can reduce overtraining without regularization.
Overtraining peaks when the number of parameters matches the dataset size.
Small initial weights are crucial for good generalization in high-dimensional regimes.
Abstract
We perform an average case analysis of the generalization dynamics of large neural networks trained using gradient descent. We study the practically-relevant "high-dimensional" regime where the number of free parameters in the network is on the order of or even larger than the number of examples in the dataset. Using random matrix theory and exact solutions in linear models, we derive the generalization error and training error dynamics of learning and analyze how they depend on the dimensionality of data and signal to noise ratio of the learning problem. We find that the dynamics of gradient descent learning naturally protect against overtraining and overfitting in large networks. Overtraining is worst at intermediate network sizes, when the effective number of free parameters equals the number of samples, and thus can be reduced by making a network smaller or larger. Additionally, in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsEarly Stopping
