SGD on Neural Networks Learns Functions of Increasing Complexity
Preetum Nakkiran, Gal Kaplun, Dimitris Kalimeris, Tristan Yang,, Benjamin L. Edelman, Fred Zhang, Boaz Barak

TL;DR
This paper investigates how SGD progressively learns functions of increasing complexity in deep neural networks, explaining their generalization ability and the retention of initial linear classifiers during training.
Contribution
It provides experimental evidence that SGD initially learns simple linear classifiers and gradually moves to more complex functions, supported by a new measure based on conditional mutual information.
Findings
Initial performance gains are explained by linear classifiers.
SGD retains the initial linear classifier even after further training.
A new measure based on conditional mutual information quantifies classifier similarity.
Abstract
We perform an experimental study of the dynamics of Stochastic Gradient Descent (SGD) in learning deep neural networks for several real and synthetic classification tasks. We show that in the initial epochs, almost all of the performance improvement of the classifier obtained by SGD can be explained by a linear classifier. More generally, we give evidence for the hypothesis that, as iterations progress, SGD learns functions of increasing complexity. This hypothesis can be helpful in explaining why SGD-learned classifiers tend to generalize well even in the over-parameterized regime. We also show that the linear classifier learned in the initial stages is "retained" throughout the execution even if training is continued to the point of zero training error, and complement this with a theoretical result in a simplified model. Key to our work is a new measure of how well one classifier…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Machine Learning and Data Classification · Advanced Data Processing Techniques
MethodsStochastic Gradient Descent
