Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks
Difan Zou, Yuan Cao, Dongruo Zhou, Quanquan Gu

TL;DR
This paper proves that over-parameterized deep ReLU networks trained with gradient descent or stochastic gradient descent can reach global minima, explaining why training deep networks is often successful.
Contribution
It provides a theoretical analysis showing that proper initialization and over-parameterization enable gradient methods to find global minima in deep ReLU networks.
Findings
Gradient descent finds global minima in over-parameterized deep ReLU networks.
Proper random initialization keeps training within a favorable local region.
The empirical loss exhibits nice local curvature properties facilitating convergence.
Abstract
We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) gradient descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) gradient descent. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Machine Learning and Algorithms
MethodsAffine Coupling · Normalizing Flows · *Communicated@Fast*How Do I Communicate to Expedia?
