Towards moderate overparameterization: global convergence guarantees for   training shallow neural networks

Samet Oymak; Mahdi Soltanolkotabi

arXiv:1902.04674·cs.LG·February 14, 2019·87 cites

Towards moderate overparameterization: global convergence guarantees for training shallow neural networks

Samet Oymak, Mahdi Soltanolkotabi

PDF

Open Access

TL;DR

This paper proves that for shallow neural networks with smooth activations, gradient descent converges to a global optimum when the square root of the number of parameters exceeds the training data size, bridging the gap between theory and practice.

Contribution

It establishes convergence guarantees for moderately overparameterized shallow neural networks, especially with ReLU activations, matching practical overparameterization levels.

Findings

01

Gradient descent converges geometrically to a global optimum under moderate overparameterization.

02

Convergence results hold for both smooth and non-differentiable activations like ReLUs.

03

Square-root of parameters exceeding data size suffices for convergence.

Abstract

Many modern neural network architectures are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Sufficiently overparameterized neural network architectures in principle have the capacity to fit any set of labels including random noise. However, given the highly nonconvex nature of the training landscape it is not clear what level and kind of overparameterization is required for first order methods to converge to a global optima that perfectly interpolate any labels. A number of recent theoretical works have shown that for very wide neural networks where the number of hidden units is polynomially large in the size of the training data gradient descent starting from a random initialization does indeed converge to a global optima. However, in practice much more moderate levels of overparameterization seems to be sufficient and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Machine Learning and Algorithms