Theory of Deep Learning IIb: Optimization Properties of SGD

Chiyuan Zhang; Qianli Liao; Alexander Rakhlin; Brando Miranda; Noah; Golowich; Tomaso Poggio

arXiv:1801.02254·cs.LG·January 9, 2018·45 cites

Theory of Deep Learning IIb: Optimization Properties of SGD

Chiyuan Zhang, Qianli Liao, Alexander Rakhlin, Brando Miranda, Noah, Golowich, Tomaso Poggio

PDF

Open Access

TL;DR

This paper investigates how stochastic gradient descent (SGD) tends to find flat, wide minima in deep convolutional networks, combining theoretical analysis and experiments to support the conjecture that SGD favors global minimizers with high probability.

Contribution

It provides new theoretical and experimental evidence that SGD preferentially converges to flat minima, which are likely to be global minimizers in deep learning.

Findings

01

SGD concentrates on flat minima similar to Langevin dynamics

02

Flat minima are with high probability global minimizers

03

Experimental results support the conjecture about SGD's behavior

Abstract

In Theory IIb we characterize with a mix of theory and experiments the optimization of deep convolutional networks by Stochastic Gradient Descent. The main new result in this paper is theoretical and experimental evidence for the following conjecture about SGD: SGD concentrates in probability -- like the classical Langevin equation -- on large volume, "flat" minima, selecting flat minimizers which are with very high probability also global minimizers

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning · Stochastic Gradient Optimization Techniques

MethodsStochastic Gradient Descent