Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU   Networks

Difan Zou; Yuan Cao; Dongruo Zhou; Quanquan Gu

arXiv:1811.08888·cs.LG·December 31, 2018·217 cites

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

Difan Zou, Yuan Cao, Dongruo Zhou, Quanquan Gu

PDF

Open Access

TL;DR

This paper proves that over-parameterized deep ReLU networks trained with gradient descent or stochastic gradient descent can reach global minima, explaining why training deep networks is often successful.

Contribution

It provides a theoretical analysis showing that proper initialization and over-parameterization enable gradient methods to find global minima in deep ReLU networks.

Findings

01

Gradient descent finds global minima in over-parameterized deep ReLU networks.

02

Proper random initialization keeps training within a favorable local region.

03

The empirical loss exhibits nice local curvature properties facilitating convergence.

Abstract

We study the problem of training deep neural networks with Rectified Linear Unit (ReLU) activation function using gradient descent and stochastic gradient descent. In particular, we study the binary classification problem and show that for a broad family of loss functions, with proper random weight initialization, both gradient descent and stochastic gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under mild assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by (stochastic) gradient descent produces a sequence of iterates that stay inside a small perturbation region centering around the initial weights, in which the empirical loss function of deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of (stochastic) gradient descent. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Machine Learning and Algorithms

MethodsAffine Coupling · Normalizing Flows · *Communicated@Fast*How Do I Communicate to Expedia?