Exponential escape efficiency of SGD from sharp minima in non-stationary regime
Hikaru Ibayashi, Masaaki Imaizumi

TL;DR
This paper develops a new theoretical framework using Large Deviation Theory to show that SGD escapes sharp minima exponentially fast even before reaching stationarity, explaining its effectiveness in training neural networks.
Contribution
It introduces a novel theory for SGD escape efficiency in non-stationary regimes, extending understanding beyond stationary distribution assumptions.
Findings
SGD escapes sharp minima exponentially fast in non-stationary regimes
The theory applies to both continuous and discrete SGD
Experimental results support the theoretical predictions
Abstract
We show that stochastic gradient descent (SGD) escapes from sharp minima exponentially fast even before SGD reaches stationary distribution. SGD has been a de-facto standard training algorithm for various machine learning tasks. However, there still exists an open question as to why SGDs find highly generalizable parameters from non-convex target functions, such as the loss function of neural networks. An "escape efficiency" has been an attractive notion to tackle this question, which measures how SGD efficiently escapes from sharp minima with potentially low generalization performance. Despite its importance, the notion has the limitation that it works only when SGD reaches a stationary distribution after sufficient updates. In this paper, we develop a new theory to investigate escape efficiency of SGD with Gaussian noise, by introducing the Large Deviation Theory for dynamical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Neural Networks and Applications · Model Reduction and Neural Networks
MethodsStochastic Gradient Descent
