Understanding Short-Range Memory Effects in Deep Neural Networks

Chengli Tan; Jiangshe Zhang; and Junmin Liu

arXiv:2105.02062·cs.LG·February 21, 2023·1 cites

Understanding Short-Range Memory Effects in Deep Neural Networks

Chengli Tan, Jiangshe Zhang, and Junmin Liu

PDF

Open Access

TL;DR

This paper proposes that stochastic gradient noise in SGD behaves like fractional Brownian motion, explaining its tendency to favor flat minima and providing new insights into its convergence and generalization properties.

Contribution

The study introduces a novel perspective that models SGD as a discretization of an SDE driven by fractional Brownian motion, highlighting the role of short-range memory effects.

Findings

01

SGN is neither Gaussian nor Levy stable.

02

SGD favors flat minima with longer residence times.

03

Short-range memory effects are consistent across models and datasets.

Abstract

Stochastic gradient descent (SGD) is of fundamental importance in deep learning. Despite its simplicity, elucidating its efficacy remains challenging. Conventionally, the success of SGD is ascribed to the stochastic gradient noise (SGN) incurred in the training process. Based on this consensus, SGD is frequently treated and analyzed as the Euler-Maruyama discretization of stochastic differential equations (SDEs) driven by either Brownian or Levy stable motion. In this study, we argue that SGN is neither Gaussian nor Levy stable. Instead, inspired by the short-range correlation emerging in the SGN series, we propose that SGD can be viewed as a discretization of an SDE driven by fractional Brownian motion (FBM). Accordingly, the different convergence behavior of SGD dynamics is well-grounded. Moreover, the first passage time of an SDE driven by FBM is approximately derived. The result…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsModel Reduction and Neural Networks · Stochastic Gradient Optimization Techniques · Gaussian Processes and Bayesian Inference

MethodsStochastic Gradient Descent