Revisiting Gradient Normalization and Clipping for Nonconvex SGD under Heavy-Tailed Noise: Necessity, Sufficiency, and Acceleration
Tao Sun, Xinwang Liu, Kun Yuan

TL;DR
This paper challenges the traditional view that gradient clipping is necessary for nonconvex SGD with heavy-tailed noise, showing that normalization alone can suffice and improve convergence rates.
Contribution
It provides a unified theoretical framework demonstrating the sufficiency of gradient normalization, alone or with clipping, for convergence under heavy-tailed noise in nonconvex optimization.
Findings
Normalization alone guarantees convergence under individual smoothness.
Combining normalization with clipping yields faster convergence in challenging noise conditions.
An accelerated method under second-order smoothness further improves convergence rates.
Abstract
Gradient clipping has long been considered essential for ensuring the convergence of Stochastic Gradient Descent (SGD) in the presence of heavy-tailed gradient noise. In this paper, we revisit this belief and explore whether gradient normalization can serve as an effective alternative or complement. We prove that, under individual smoothness assumptions, gradient normalization alone is sufficient to guarantee convergence of the nonconvex SGD. Moreover, when combined with clipping, it yields far better rates of convergence under more challenging noise distributions. We provide a unifying theory describing normalization-only, clipping-only, and combined approaches. Moving forward, we investigate existing variance-reduced algorithms, establishing that, in such a setting, normalization alone is sufficient for convergence. Finally, we present an accelerated variant that under second-order…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSparse and Compressive Sensing Techniques
MethodsGradient Clipping · Gradient Normalization · Stochastic Gradient Descent
