Gradient Descent Provably Optimizes Over-parameterized Neural Networks

Simon S. Du; Xiyu Zhai; Barnabas Poczos; Aarti Singh

arXiv:1810.02054·cs.LG·February 6, 2019·418 cites

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

Simon S. Du, Xiyu Zhai, Barnabas Poczos, Aarti Singh

PDF

Open Access

TL;DR

This paper proves that over-parameterized two-layer ReLU neural networks trained with gradient descent can achieve global optimality and linear convergence, explaining their empirical success despite non-convexity.

Contribution

It provides a rigorous theoretical analysis showing gradient descent converges globally for over-parameterized shallow networks with random initialization.

Findings

01

Gradient descent converges to global optimum under over-parameterization.

02

Convergence occurs at a linear rate for the quadratic loss.

03

Over-parameterization and initialization keep weights close to start, enabling strong convexity-like properties.

Abstract

One of the mysteries in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies this surprising phenomenon for two-layer fully connected ReLU activated neural networks. For an $m$ hidden node shallow neural network with ReLU activation and $n$ training data, we show as long as $m$ is large enough and no two inputs are parallel, randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function. Our analysis relies on the following observation: over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows us to exploit a strong convexity-like property to show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Advanced Neural Network Applications