Nonlinear Stochastic Gradient Descent and Heavy-tailed Noise: A Unified Framework and High-probability Guarantees
Aleksandar Armacki, Shuhua Yu, Pranay Sharma, Gauri Joshi, Dragana, Bajovic, Dusan Jakovetic, Soummya Kar

TL;DR
This paper develops a unified theoretical framework for nonlinear stochastic gradient descent methods under heavy-tailed noise, providing high-probability convergence guarantees for various nonlinearities without requiring noise moment assumptions.
Contribution
It introduces a black-box analysis of nonlinear SGD, establishing convergence guarantees for a broad class of nonlinearities under heavy-tailed noise, improving upon existing bounds.
Findings
Unified guarantees for nonlinear SGD methods.
Convergence rates depend on noise and problem parameters.
Clipping is not always the optimal nonlinearity.
Abstract
We study high-probability convergence in online learning, in the presence of heavy-tailed noise. To combat the heavy tails, a general framework of nonlinear SGD methods is considered, subsuming several popular nonlinearities like sign, quantization, component-wise and joint clipping. In our work the nonlinearity is treated in a black-box manner, allowing us to establish unified guarantees for a broad range of nonlinear methods. For symmetric noise and non-convex costs we establish convergence of gradient norm-squared, at a rate , while for the last iterate of strongly convex costs we establish convergence to the population optima, at a rate , where depends on noise and problem parameters. Further, if the noise is a (biased) mixture of symmetric and non-symmetric components, we show convergence to a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic processes and financial applications · Gas Dynamics and Kinetic Theory · Target Tracking and Data Fusion in Sensor Networks
MethodsStochastic Gradient Descent
