Theory of Deep Learning III: explaining the non-overfitting puzzle
Tomaso Poggio, Kenji Kawaguchi, Qianli Liao, Brando Miranda, Lorenzo, Rosasco, Xavier Boix, Jack Hidary, Hrushikesh Mhaskar

TL;DR
This paper explains why deep neural networks do not overfit despite their large capacity, by showing that their training dynamics are similar to linear systems near stable minima, leading to implicit regularization and good generalization.
Contribution
It extends properties of gradient descent from linear to nonlinear deep networks, providing a topological and dynamical systems perspective on overfitting and generalization.
Findings
Gradient descent enforces implicit regularization controlled by iteration count.
Convergence to minimum norm solutions for regression and maximum margin for classification.
Supports robustness of deep networks against overparametrization.
Abstract
A main puzzle of deep networks revolves around the absence of overfitting despite large overparametrization and despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamics associated to gradient descent minimization of nonlinear networks is topologically equivalent, near the asymptotically stable minima of the empirical error, to linear gradient system in a quadratic potential with a degenerate (for square loss) or almost degenerate (for logistic or crossentropy loss) Hessian. The proposition depends on the qualitative theory of dynamical systems and is supported by numerical results. Our main propositions extend to deep nonlinear networks two properties of gradient descent for linear networks, that have been recently established (1) to be key to their generalization properties: 1. Gradient descent enforces a form of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Gaussian Processes and Bayesian Inference
MethodsEarly Stopping
