On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep   Neural Networks

Umut \c{S}im\c{s}ekli; Mert G\"urb\"uzbalaban; Thanh Huy Nguyen,; Ga\"el Richard; Levent Sagun

arXiv:1912.00018·stat.ML·December 3, 2019·22 cites

On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks

Umut \c{S}im\c{s}ekli, Mert G\"urb\"uzbalaban, Thanh Huy Nguyen,, Ga\"el Richard, Levent Sagun

PDF

Open Access

TL;DR

This paper challenges the Gaussian noise assumption in SGD analysis for deep learning, proposing a heavy-tailed -stable distribution model driven by Lévy motion, which better explains the convergence and minima transition behaviors.

Contribution

It introduces a heavy-tailed -stable noise model for SGD, linking tail index to convergence rates and providing experimental validation across various deep learning settings.

Findings

01

SGD noise is highly non-Gaussian with heavy tails

02

Heavy-tailed noise influences transition from narrow to wide minima

03

Convergence rate is explicitly connected to the tail index

Abstract

The gradient noise (GN) in the stochastic gradient descent (SGD) algorithm is often considered to be Gaussian in the large data regime by assuming that the \emph{classical} central limit theorem (CLT) kicks in. This assumption is often made for mathematical convenience, since it enables SGD to be analyzed as a stochastic differential equation (SDE) driven by a Brownian motion. We argue that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate. Inspired by non-Gaussian natural phenomena, we consider the GN in a more general context and invoke the \emph{generalized} CLT, which suggests that the GN converges to a \emph{heavy-tailed} $α$ -stable random vector, where \emph{tail-index} $α$ determines the heavy-tailedness of the distribution. Accordingly, we propose to analyze SGD as a discretization…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Adversarial Robustness in Machine Learning

MethodsStochastic Gradient Descent