On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks
Umut \c{S}im\c{s}ekli, Mert G\"urb\"uzbalaban, Thanh Huy Nguyen,, Ga\"el Richard, Levent Sagun

TL;DR
This paper challenges the Gaussian noise assumption in SGD analysis for deep learning, proposing a heavy-tailed -stable distribution model driven by Lévy motion, which better explains the convergence and minima transition behaviors.
Contribution
It introduces a heavy-tailed -stable noise model for SGD, linking tail index to convergence rates and providing experimental validation across various deep learning settings.
Findings
SGD noise is highly non-Gaussian with heavy tails
Heavy-tailed noise influences transition from narrow to wide minima
Convergence rate is explicitly connected to the tail index
Abstract
The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the \emph{classical} central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Inspired by non-Gaussian natural phenomena, we consider the GN in a more general context and invoke the \emph{generalized} CLT, which suggests that the GN converges to a \emph{heavy-tailed} -stable random vector, where \emph{tail-index} determines the heavy-tailedness of the distribution. Accordingly, we propose to analyze SGD as a discretization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Adversarial Robustness in Machine Learning
MethodsStochastic Gradient Descent
