A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks
Umut Simsekli, Levent Sagun, Mert Gurbuzbalaban

TL;DR
This paper challenges the Gaussian assumption of gradient noise in SGD for deep learning, proposing a heavy-tailed alpha-stable model driven by Lévy motion, supported by extensive empirical evidence across architectures and datasets.
Contribution
It introduces a non-Gaussian heavy-tailed model for gradient noise in SGD, replacing the classical Brownian motion framework with Lévy motion analysis.
Findings
Gradient noise is highly non-Gaussian with heavy tails across settings.
SGD transitions from narrow to wide minima due to jumps in Lévy-driven SDEs.
Heavy-tailed behavior varies with architecture, size, and dataset.
Abstract
The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the classical central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Inspired by non-Gaussian natural phenomena, we consider the GN in a more general context and invoke the generalized CLT (GCLT), which suggests that the GN converges to a heavy-tailed -stable random variable. Accordingly, we propose to analyze SGD as an SDE driven by a L\'{e}vy motion. Such SDEs can incur `jumps', which force the SDE transition from narrow minima…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Adversarial Robustness in Machine Learning · Gaussian Processes and Bayesian Inference
MethodsStochastic Gradient Descent
