A Convergence Theory for Deep Learning via Over-Parameterization

Zeyuan Allen-Zhu; Yuanzhi Li; Zhao Song

arXiv:1811.03962·cs.LG·June 18, 2019·628 cites

A Convergence Theory for Deep Learning via Over-Parameterization

Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song

PDF

Open Access

TL;DR

This paper provides a theoretical explanation for why over-parameterized deep neural networks trained with SGD can efficiently find global minima, demonstrating polynomial-time convergence and near-convex landscape properties.

Contribution

It proves that over-parameterized deep networks have nearly-convex landscapes near initialization, enabling polynomial-time convergence of SGD to global minima for various architectures.

Findings

01

SGD finds global minima in polynomial time under over-parameterization.

02

The optimization landscape is almost-convex and semi-smooth near initialization.

03

The theory applies to ReLU, CNNs, and ResNets.

Abstract

Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works has been focusing on training neural networks with one hidden layer. The theory of multi-layer networks remains largely unsettled. In this work, we prove why stochastic gradient descent (SGD) can find $global minima$ on the training objective of DNNs in $polynomial time$ . We only make two assumptions: the inputs are non-degenerate and the network is over-parameterized. The latter means the network width is sufficiently large: $polynomial$ in $L$ , the number of layers and in $n$ , the number of samples. Our key technique is to derive that, in a sufficiently large neighborhood of the random initialization, the optimization landscape is almost-convex and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsReinforcement Learning in Robotics · Model Reduction and Neural Networks · Evolutionary Algorithms and Applications

Methods1x1 Convolution · Convolution · Local Response Normalization · Grouped Convolution · Dropout · Dense Connections · Max Pooling · Softmax · How do I speak to a person at Expedia?-/+/ · *Communicated@Fast*How Do I Communicate to Expedia?