Sharp High-Probability Rates for Nonlinear SGD under Heavy-Tailed Noise via Symmetrization
Aleksandar Armacki, Dragana Bajovic, Dusan Jakovetic, Soummya Kar

TL;DR
This paper introduces a nonlinear stochastic gradient descent framework that achieves high-probability convergence rates under heavy-tailed, possibly non-symmetric noise, using symmetrization techniques for improved robustness and theoretical guarantees.
Contribution
It develops a unified nonlinear SGD framework with novel symmetrization-based estimators, achieving optimal convergence rates under relaxed heavy-tailed noise conditions, including non-symmetric cases.
Findings
N-SGD attains (t^{-1/2}) rate with exponential tail decay.
Symmetrized estimators handle non-symmetric heavy-tailed noise effectively.
The framework improves convergence guarantees compared to prior bounded-moment assumptions.
Abstract
We study convergence in high-probability of SGD-type methods in non-convex optimization and the presence of heavy-tailed noise. To combat the heavy-tailed noise, a general black-box nonlinear framework is considered, subsuming nonlinearities like sign, clipping, normalization and their smooth counterparts. Our first result shows that nonlinear SGD (N-SGD) achieves the rate , for any noise with unbounded moments and a symmetric probability density function (PDF). Crucially, N-SGD has exponentially decaying tails, matching the performance of linear SGD under light-tailed noise. To handle non-symmetric noise, we propose two novel estimators, based on the idea of noise symmetrization. The first, dubbed Symmetrized Gradient Estimator (SGE), assumes a noiseless gradient at any reference point is available at the start of training, while the second, dubbed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
