Global Dynamics of Heavy-Tailed SGDs in Nonconvex Loss Landscape: Characterization and Control
Xingyu Wang, Chang-Han Rhee

TL;DR
This paper provides a theoretical and empirical analysis of how heavy-tailed stochastic gradient descent (SGD) dynamics help avoid sharp minima and improve generalization in deep learning, using large deviations and metastability analysis.
Contribution
It introduces a novel global dynamics framework for heavy-tailed SGD, revealing how noise injection and truncation enhance avoidance of sharp minima and generalization.
Findings
Heavy-tailed SGD avoids sharp minima more effectively.
Gradient clipping leads to flatter minima and better test performance.
Theoretical predictions are confirmed by simulations and deep learning experiments.
Abstract
Stochastic gradient descent (SGD) and its variants enable modern artificial intelligence. However, theoretical understanding lags far behind their empirical success. It is widely believed that SGD has a curious ability to avoid sharp local minima in the loss landscape, which are associated with poor generalization. To unravel this mystery and further enhance such capability of SGDs, it is imperative to go beyond the traditional local convergence analysis and obtain a comprehensive understanding of SGDs' global dynamics. In this paper, we develop a set of technical machinery based on the recent large deviations and metastability analysis in Wang and Rhee (2023) and obtain sharp characterization of the global dynamics of heavy-tailed SGDs. In particular, we reveal a fascinating phenomenon in deep learning: by injecting and then truncating heavy-tailed noises during the training phase, SGD…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
