The Heavy-Tail Phenomenon in SGD
Mert Gurbuzbalaban, Umut \c{S}im\c{s}ekli, Lingjiong Zhu

TL;DR
This paper links the heaviness of the tails in SGD's weight distribution to the Hessian structure and algorithm parameters, providing a unified view of generalization properties in deep learning.
Contribution
It establishes a theoretical connection between flatness, noise ratio, and tail-index, showing that SGD converges to heavy-tailed distributions under certain conditions.
Findings
SGD iterates can have heavy tails with infinite variance.
The tail behavior depends on Hessian structure and algorithm parameters.
Experimental results support the theoretical insights.
Abstract
In recent years, various notions of capacity and complexity have been proposed for characterizing the generalization properties of stochastic gradient descent (SGD) in deep learning. Some of the popular notions that correlate well with the performance on unseen data are (i) the `flatness' of the local minimum found by SGD, which is related to the eigenvalues of the Hessian, (ii) the ratio of the stepsize to the batch-size , which essentially controls the magnitude of the stochastic gradient noise, and (iii) the `tail-index', which measures the heaviness of the tails of the network weights at convergence. In this paper, we argue that these three seemingly unrelated perspectives for generalization are deeply linked to each other. We claim that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters and , the SGD…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Machine Learning and ELM
MethodsLinear Regression · Stochastic Gradient Descent
