On the diffusion approximation of nonconvex stochastic gradient descent

Wenqing Hu; Chris Junchi Li; Lei Li; Jian-Guo Liu

arXiv:1705.07562·stat.ML·March 6, 2018·30 cites

On the diffusion approximation of nonconvex stochastic gradient descent

Wenqing Hu, Chris Junchi Li, Lei Li, Jian-Guo Liu

PDF

Open Access

TL;DR

This paper rigorously analyzes how stochastic gradient descent (SGD) in nonconvex optimization can be approximated by diffusion processes, revealing insights into its escape dynamics from local minima and saddle points, and the influence of batch size.

Contribution

It provides a rigorous diffusion approximation of SGD in nonconvex problems and explores how batch size affects escape from minima and saddle points.

Findings

01

Diffusion process approximates SGD weakly in small step size regime.

02

SGD escapes local minima exponentially and saddle points linearly depending on inverse step size.

03

Small batch sizes help SGD escape unstable points and sharp minima, suggesting larger batch sizes for better generalization.

Abstract

We study the Stochastic Gradient Descent (SGD) method in nonconvex optimization problems from the point of view of approximating diffusion processes. We prove rigorously that the diffusion process can approximate the SGD algorithm weakly using the weak form of master equation for probability evolution. In the small step size regime and the presence of omnidirectional noise, our weak approximating diffusion process suggests the following dynamics for the SGD iteration starting from a local minimizer (resp.~saddle point): it escapes in a number of iterations exponentially (resp.~almost linearly) dependent on the inverse stepsize. The results are obtained using the theory for random perturbations of dynamical systems (theory of large deviations for local minimizers and theory of exiting for unstable stationary points). In addition, we discuss the effects of batch size for the deep neural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Sparse and Compressive Sensing Techniques

MethodsStochastic Gradient Descent