Theory of Deep Learning IIb: Optimization Properties of SGD
Chiyuan Zhang, Qianli Liao, Alexander Rakhlin, Brando Miranda, Noah, Golowich, Tomaso Poggio

TL;DR
This paper investigates how stochastic gradient descent (SGD) tends to find flat, wide minima in deep convolutional networks, combining theoretical analysis and experiments to support the conjecture that SGD favors global minimizers with high probability.
Contribution
It provides new theoretical and experimental evidence that SGD preferentially converges to flat minima, which are likely to be global minimizers in deep learning.
Findings
SGD concentrates on flat minima similar to Langevin dynamics
Flat minima are with high probability global minimizers
Experimental results support the conjecture about SGD's behavior
Abstract
In Theory IIb we characterize with a mix of theory and experiments the optimization of deep convolutional networks by Stochastic Gradient Descent. The main new result in this paper is theoretical and experimental evidence for the following conjecture about SGD: SGD concentrates in probability -- like the classical Langevin equation -- on large volume, "flat" minima, selecting flat minimizers which are with very high probability also global minimizers
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Adversarial Robustness in Machine Learning · Stochastic Gradient Optimization Techniques
MethodsStochastic Gradient Descent
