Theory of Deep Learning III: explaining the non-overfitting puzzle

Tomaso Poggio; Kenji Kawaguchi; Qianli Liao; Brando Miranda; Lorenzo; Rosasco; Xavier Boix; Jack Hidary; Hrushikesh Mhaskar

arXiv:1801.00173·cs.LG·January 17, 2018·49 cites

Theory of Deep Learning III: explaining the non-overfitting puzzle

Tomaso Poggio, Kenji Kawaguchi, Qianli Liao, Brando Miranda, Lorenzo, Rosasco, Xavier Boix, Jack Hidary, Hrushikesh Mhaskar

PDF

Open Access

TL;DR

This paper explains why deep neural networks do not overfit despite their large capacity, by showing that their training dynamics are similar to linear systems near stable minima, leading to implicit regularization and good generalization.

Contribution

It extends properties of gradient descent from linear to nonlinear deep networks, providing a topological and dynamical systems perspective on overfitting and generalization.

Findings

01

Gradient descent enforces implicit regularization controlled by iteration count.

02

Convergence to minimum norm solutions for regression and maximum margin for classification.

03

Supports robustness of deep networks against overparametrization.

Abstract

A main puzzle of deep networks revolves around the absence of overfitting despite large overparametrization and despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamics associated to gradient descent minimization of nonlinear networks is topologically equivalent, near the asymptotically stable minima of the empirical error, to linear gradient system in a quadratic potential with a degenerate (for square loss) or almost degenerate (for logistic or crossentropy loss) Hessian. The proposition depends on the qualitative theory of dynamical systems and is supported by numerical results. Our main propositions extend to deep nonlinear networks two properties of gradient descent for linear networks, that have been recently established (1) to be key to their generalization properties: 1. Gradient descent enforces a form of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Gaussian Processes and Bayesian Inference

MethodsEarly Stopping