Global Dynamics of Heavy-Tailed SGDs in Nonconvex Loss Landscape: Characterization and Control

Xingyu Wang; Chang-Han Rhee

arXiv:2510.20905·cs.LG·October 27, 2025

Global Dynamics of Heavy-Tailed SGDs in Nonconvex Loss Landscape: Characterization and Control

Xingyu Wang, Chang-Han Rhee

PDF

TL;DR

This paper provides a theoretical and empirical analysis of how heavy-tailed stochastic gradient descent (SGD) dynamics help avoid sharp minima and improve generalization in deep learning, using large deviations and metastability analysis.

Contribution

It introduces a novel global dynamics framework for heavy-tailed SGD, revealing how noise injection and truncation enhance avoidance of sharp minima and generalization.

Findings

01

Heavy-tailed SGD avoids sharp minima more effectively.

02

Gradient clipping leads to flatter minima and better test performance.

03

Theoretical predictions are confirmed by simulations and deep learning experiments.

Abstract

Stochastic gradient descent (SGD) and its variants enable modern artificial intelligence. However, theoretical understanding lags far behind their empirical success. It is widely believed that SGD has a curious ability to avoid sharp local minima in the loss landscape, which are associated with poor generalization. To unravel this mystery and further enhance such capability of SGDs, it is imperative to go beyond the traditional local convergence analysis and obtain a comprehensive understanding of SGDs' global dynamics. In this paper, we develop a set of technical machinery based on the recent large deviations and metastability analysis in Wang and Rhee (2023) and obtain sharp characterization of the global dynamics of heavy-tailed SGDs. In particular, we reveal a fascinating phenomenon in deep learning: by injecting and then truncating heavy-tailed noises during the training phase, SGD…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.