A Convergence Theory for Deep Learning via Over-Parameterization
Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song

TL;DR
This paper provides a theoretical explanation for why over-parameterized deep neural networks trained with SGD can efficiently find global minima, demonstrating polynomial-time convergence and near-convex landscape properties.
Contribution
It proves that over-parameterized deep networks have nearly-convex landscapes near initialization, enabling polynomial-time convergence of SGD to global minima for various architectures.
Findings
SGD finds global minima in polynomial time under over-parameterization.
The optimization landscape is almost-convex and semi-smooth near initialization.
The theory applies to ReLU, CNNs, and ResNets.
Abstract
Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works has been focusing on training neural networks with one hidden layer. The theory of multi-layer networks remains largely unsettled. In this work, we prove why stochastic gradient descent (SGD) can find on the training objective of DNNs in . We only make two assumptions: the inputs are non-degenerate and the network is over-parameterized. The latter means the network width is sufficiently large: in , the number of layers and in , the number of samples. Our key technique is to derive that, in a sufficiently large neighborhood of the random initialization, the optimization landscape is almost-convex and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsReinforcement Learning in Robotics · Model Reduction and Neural Networks · Evolutionary Algorithms and Applications
Methods1x1 Convolution · Convolution · Local Response Normalization · Grouped Convolution · Dropout · Dense Connections · Max Pooling · Softmax · How do I speak to a person at Expedia?-/+/ · *Communicated@Fast*How Do I Communicate to Expedia?
