The Heavy-Tail Phenomenon in SGD

Mert Gurbuzbalaban; Umut \c{S}im\c{s}ekli; Lingjiong Zhu

arXiv:2006.04740·math.OC·June 15, 2021·38 cites

The Heavy-Tail Phenomenon in SGD

Mert Gurbuzbalaban, Umut \c{S}im\c{s}ekli, Lingjiong Zhu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper links the heaviness of the tails in SGD's weight distribution to the Hessian structure and algorithm parameters, providing a unified view of generalization properties in deep learning.

Contribution

It establishes a theoretical connection between flatness, noise ratio, and tail-index, showing that SGD converges to heavy-tailed distributions under certain conditions.

Findings

01

SGD iterates can have heavy tails with infinite variance.

02

The tail behavior depends on Hessian structure and algorithm parameters.

03

Experimental results support the theoretical insights.

Abstract

In recent years, various notions of capacity and complexity have been proposed for characterizing the generalization properties of stochastic gradient descent (SGD) in deep learning. Some of the popular notions that correlate well with the performance on unseen data are (i) the `flatness' of the local minimum found by SGD, which is related to the eigenvalues of the Hessian, (ii) the ratio of the stepsize $η$ to the batch-size $b$ , which essentially controls the magnitude of the stochastic gradient noise, and (iii) the `tail-index', which measures the heaviness of the tails of the network weights at convergence. In this paper, we argue that these three seemingly unrelated perspectives for generalization are deeply linked to each other. We claim that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters $η$ and $b$ , the SGD…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

umutsimsekli/sgd_ht
pytorchOfficial

Videos

The Heavy-Tail Phenomenon in SGD· slideslive

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Machine Learning and ELM

MethodsLinear Regression · Stochastic Gradient Descent