Understanding Short-Range Memory Effects in Deep Neural Networks
Chengli Tan, Jiangshe Zhang, and Junmin Liu

TL;DR
This paper proposes that stochastic gradient noise in SGD behaves like fractional Brownian motion, explaining its tendency to favor flat minima and providing new insights into its convergence and generalization properties.
Contribution
The study introduces a novel perspective that models SGD as a discretization of an SDE driven by fractional Brownian motion, highlighting the role of short-range memory effects.
Findings
SGN is neither Gaussian nor Levy stable.
SGD favors flat minima with longer residence times.
Short-range memory effects are consistent across models and datasets.
Abstract
Stochastic gradient descent (SGD) is of fundamental importance in deep learning. Despite its simplicity, elucidating its efficacy remains challenging. Conventionally, the success of SGD is ascribed to the stochastic gradient noise (SGN) incurred in the training process. Based on this consensus, SGD is frequently treated and analyzed as the Euler-Maruyama discretization of stochastic differential equations (SDEs) driven by either Brownian or Levy stable motion. In this study, we argue that SGN is neither Gaussian nor Levy stable. Instead, inspired by the short-range correlation emerging in the SGN series, we propose that SGD can be viewed as a discretization of an SDE driven by fractional Brownian motion (FBM). Accordingly, the different convergence behavior of SGD dynamics is well-grounded. Moreover, the first passage time of an SDE driven by FBM is approximately derived. The result…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsModel Reduction and Neural Networks · Stochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference
MethodsStochastic Gradient Descent
