Avoiding Bias in Clipped SGD for Overparameterized Models under Generalized Smoothness
Aleksandr Lobanov, Anastasia Koloskova

TL;DR
This paper provides a theoretical analysis showing that clipped and normalized SGD can effectively converge in overparameterized models without bias under generalized smoothness conditions, explaining their empirical success.
Contribution
It introduces a novel analysis under $(L_0,L_1)$-smoothness, demonstrating convergence of clipped and normalized SGD without bias in overparameterized models.
Findings
Clipped and normalized SGD do not suffer bias and converge effectively in overparameterized models.
The analysis under generalized smoothness improves upon prior convergence results.
The results extend to heavy-tailed noise and deterministic regimes.
Abstract
Modern machine learning is dominated by complex, overparameterized architectures capable of interpolating data and achieving zero training loss. For such models, we investigate the convergence properties of two popular modifications to standard SGD: clipped SGD and normalized SGD. We show that under overparameterization and a mild assumption on batch size, both clipped and normalized SGD do not suffer from the bias typically introduced by clipping, converging effectively at the same rate as their deterministic counterparts. This provides a rigorous theoretical justification for the empirical success of gradient clipping methods. In our analysis, we employ the -smoothness condition, under which we obtain convergence rates that improve upon the best known results in prior work. Furthermore, we extend our analysis to specific challenging regimes, including heavy-tailed noise,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
