Can SGD Handle Heavy-Tailed Noise?
Ilyas Fatkhullin, Florian H\"ubler, Guanghui Lan

TL;DR
This paper provides a rigorous theoretical analysis demonstrating that vanilla SGD can effectively handle heavy-tailed noise in various optimization settings, establishing optimal convergence rates under minimal assumptions.
Contribution
It proves sharp convergence guarantees for vanilla SGD under heavy-tailed noise across convex, strongly convex, and non-convex problems, with minimax optimal sample complexities.
Findings
SGD achieves minimax optimal sample complexity in convex regimes.
Convergence to stationary points in non-convex settings with matching lower bounds.
Non-convex Mini-batch SGD also attains similar sample complexity, possibly with improved constants.
Abstract
Stochastic Gradient Descent (SGD) is a cornerstone of large-scale optimization, yet its theoretical behavior under heavy-tailed noise -- common in modern machine learning and reinforcement learning -- remains poorly understood. In this work, we rigorously investigate whether vanilla SGD, devoid of any adaptive modifications, can provably succeed under such adverse stochastic conditions. Assuming only that stochastic gradients have bounded -th moments for some , we establish sharp convergence guarantees for (projected) SGD across convex, strongly convex, and non-convex problem classes. In particular, we show that SGD achieves minimax optimal sample complexity under minimal assumptions in the convex and strongly convex regimes: and , respectively. For non-convex objectives under…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
