# Adaptive Heavy-Tailed Stochastic Gradient Descent

**Authors:** Bodu Gong, Gustavo Enrique Batista, Pierre Lafaye de Micheaux

arXiv: 2508.21353 · 2025-09-01

## TL;DR

This paper introduces AHTSGD, an adaptive optimizer that injects heavy-tailed noise during early training to promote exploration of wide minima, leading to better generalization and faster convergence in neural network training.

## Contribution

It presents the first adaptive noise injection method based on the Edge of Stability phenomenon, improving training efficiency and generalization in neural networks.

## Key findings

- Outperforms SGD and other noise-based methods on MNIST and CIFAR-10.
- Achieves better results on noisy datasets like SVHN.
- Accelerates early training and enhances robustness to learning rate choices.

## Abstract

In the era of large-scale neural network models, optimization algorithms often struggle with generalization due to an overreliance on training loss. One key insight widely accepted in the machine learning community is the idea that wide basins (regions around a local minimum where the loss increases gradually) promote better generalization by offering greater stability to small changes in input data or model parameters. In contrast, sharp minima are typically more sensitive and less stable. Motivated by two key empirical observations - the inherent heavy-tailed distribution of gradient noise in stochastic gradient descent and the Edge of Stability phenomenon during neural network training, in which curvature grows before settling at a plateau, we introduce Adaptive Heavy Tailed Stochastic Gradient Descent (AHTSGD). The algorithm injects heavier-tailed noise into the optimizer during the early stages of training to enhance exploration and gradually transitions to lighter-tailed noise as sharpness stabilizes. By dynamically adapting to the sharpness of the loss landscape throughout training, AHTSGD promotes accelerated convergence to wide basins. AHTSGD is the first algorithm to adjust the nature of injected noise into an optimizer based on the Edge of Stability phenomenon. AHTSGD consistently outperforms SGD and other noise-based methods on benchmarks like MNIST and CIFAR-10, with marked gains on noisy datasets such as SVHN. It ultimately accelerates early training from poor initializations and improves generalization across clean and noisy settings, remaining robust to learning rate choices.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21353/full.md

## Figures

17 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21353/full.md

## References

26 references — full list in the complete paper: https://tomesphere.com/paper/2508.21353/full.md

---
Source: https://tomesphere.com/paper/2508.21353