From Gradient Clipping to Normalization for Heavy Tailed SGD
Florian H\"ubler, Ilyas Fatkhullin, Niao He

TL;DR
This paper analyzes the convergence of Normalized SGD (NSGD) in heavy-tailed gradient noise settings, providing new theoretical guarantees and optimal sample complexity bounds that improve upon existing gradient clipping methods.
Contribution
It introduces a parameter-free convergence analysis of NSGD with tight sample complexity bounds, addressing limitations of gradient clipping in heavy-tailed noise scenarios.
Findings
NSGD achieves a sample complexity of O(ε^{-2p/(p-1)}) for ε-stationary points.
Matching lower bounds demonstrate the optimality of the proposed complexity.
High-probability convergence with mild dependence on failure probability is established.
Abstract
Recent empirical evidence indicates that many machine learning applications involve heavy-tailed gradient noise, which challenges the standard assumptions of bounded variance in stochastic optimization. Gradient clipping has emerged as a popular tool to handle this heavy-tailed noise, as it achieves good performance in this setting both theoretically and practically. However, our current theoretical understanding of non-convex gradient clipping has three main shortcomings. First, the theory hinges on large, increasing clipping thresholds, which are in stark contrast to the small constant clipping thresholds employed in practice. Second, clipping thresholds require knowledge of problem-dependent parameters to guarantee convergence. Lastly, even with this knowledge, current sampling complexity upper bounds for the method are sub-optimal in nearly all parameters. To address these issues,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGas Dynamics and Kinetic Theory · Atomic and Molecular Physics · Advanced Numerical Methods in Computational Mathematics
MethodsStochastic Gradient Descent · Gradient Clipping
