SGD Learns Over-parameterized Networks that Provably Generalize on   Linearly Separable Data

Alon Brutzkus; Amir Globerson; Eran Malach; Shai Shalev-Shwartz

arXiv:1710.10174·cs.LG·October 30, 2017·37 cites

SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

Alon Brutzkus, Amir Globerson, Eran Malach, Shai Shalev-Shwartz

PDF

Open Access

TL;DR

This paper proves that stochastic gradient descent (SGD) can effectively train over-parameterized two-layer neural networks with Leaky ReLU activations on linearly separable data, achieving global minima and avoiding overfitting.

Contribution

It provides the first theoretical guarantees that SGD finds global minima and generalizes well in over-parameterized neural networks on linearly separable data.

Findings

01

SGD converges to a global minimum in over-parameterized networks.

02

Generalization bounds are independent of network size.

03

SGD avoids overfitting despite high model capacity.

Abstract

Neural networks exhibit good generalization behavior in the over-parameterized regime, where the number of network parameters exceeds the number of observations. Nonetheless, current generalization bounds for neural networks fail to explain this phenomenon. In an attempt to bridge this gap, we study the problem of learning a two-layer over-parameterized neural network, when the data is generated by a linearly separable function. In the case where the network has Leaky ReLU activations, we provide both optimization and generalization guarantees for over-parameterized networks. Specifically, we prove convergence rates of SGD to a global minimum and provide generalization guarantees for this global minimum that are independent of the network size. Therefore, our result clearly shows that the use of SGD for optimization both finds a global minimum, and avoids overfitting despite the high…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Stochastic Gradient Optimization Techniques · Model Reduction and Neural Networks

Methods*Communicated@Fast*How Do I Communicate to Expedia? · HuMan(Expedia)||How do I get a human at Expedia? · Stochastic Gradient Descent