Avoiding Bias in Clipped SGD for Overparameterized Models under Generalized Smoothness

Aleksandr Lobanov; Anastasia Koloskova

arXiv:2605.14800·math.OC·May 15, 2026

Avoiding Bias in Clipped SGD for Overparameterized Models under Generalized Smoothness

Aleksandr Lobanov, Anastasia Koloskova

PDF

TL;DR

This paper provides a theoretical analysis showing that clipped and normalized SGD can effectively converge in overparameterized models without bias under generalized smoothness conditions, explaining their empirical success.

Contribution

It introduces a novel analysis under $(L_0,L_1)$-smoothness, demonstrating convergence of clipped and normalized SGD without bias in overparameterized models.

Findings

01

Clipped and normalized SGD do not suffer bias and converge effectively in overparameterized models.

02

The analysis under generalized smoothness improves upon prior convergence results.

03

The results extend to heavy-tailed noise and deterministic regimes.

Abstract

Modern machine learning is dominated by complex, overparameterized architectures capable of interpolating data and achieving zero training loss. For such models, we investigate the convergence properties of two popular modifications to standard SGD: clipped SGD and normalized SGD. We show that under overparameterization and a mild assumption on batch size, both clipped and normalized SGD do not suffer from the bias typically introduced by clipping, converging effectively at the same rate as their deterministic counterparts. This provides a rigorous theoretical justification for the empirical success of gradient clipping methods. In our analysis, we employ the $(L_{0}, L_{1})$ -smoothness condition, under which we obtain convergence rates that improve upon the best known results in prior work. Furthermore, we extend our analysis to specific challenging regimes, including heavy-tailed noise,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.