Convergence of Clipped-SGD for Convex $(L_0,L_1)$-Smooth Optimization with Heavy-Tailed Noise

Savelii Chezhegov; Aleksandr Beznosikov; Samuel Horv\'ath; Eduard Gorbunov

arXiv:2505.20817·math.OC·September 30, 2025

Convergence of Clipped-SGD for Convex $(L_0,L_1)$-Smooth Optimization with Heavy-Tailed Noise

Savelii Chezhegov, Aleksandr Beznosikov, Samuel Horv\'ath, Eduard Gorbunov

PDF

Open Access 3 Reviews

TL;DR

This paper establishes the first high-probability convergence bounds for Clipped-SGD in convex optimization with heavy-tailed noise under $(L_0,L_1)$-smoothness, broadening theoretical understanding and practical applicability.

Contribution

It provides the first high-probability convergence analysis for Clipped-SGD under heavy-tailed noise and $(L_0,L_1)$-smoothness, extending prior results and removing restrictive assumptions.

Findings

01

First high-probability convergence bounds for Clipped-SGD with heavy-tailed noise.

02

Bounds recover known results in deterministic and $L_1=0$ cases.

03

Rates avoid exponential factors and restrictive noise assumptions.

Abstract

Gradient clipping is a widely used technique in Machine Learning and Deep Learning (DL), known for its effectiveness in mitigating the impact of heavy-tailed noise, which frequently arises in the training of large language models. Additionally, first-order methods with clipping, such as Clip-SGD, exhibit stronger convergence guarantees than SGD under the $(L_{0}, L_{1})$ -smoothness assumption, a property observed in many DL tasks. However, the high-probability convergence of Clip-SGD under both assumptions -- heavy-tailed noise and $(L_{0}, L_{1})$ -smoothness -- has not been fully addressed in the literature. In this paper, we bridge this critical gap by establishing the first high-probability convergence bounds for Clip-SGD applied to convex $(L_{0}, L_{1})$ -smooth optimization with heavy-tailed noise. Our analysis extends prior results by recovering known bounds for the deterministic case and the…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. The high-probability bound of Clipped-SGD is derived for the first time for the considered problem. 2. The presented bounds recover the current best result when $L_1 = 0$.

Weaknesses

1. The problem class and algorithm are both motivated by some deep learning in particular attention models, but the theoretical results are only presented for convex functions. It would be great if non-convex case can be studied. 2. The considered problem class (for more general nonconvex functions) has been studied in [1] and optimal in-expectation rate using normalized SGD has been derived. It would be helpful if this work can be compared with. 3. No numerical experiments presented. [1] L

Reviewer 02Rating 8Confidence 3

Strengths

The authors identify a conflict in prior work: to handle $(L_{0},L_{1})$-smoothness, the clipping threshold $\lambda$ is typically set to a fixed constant, whereas to handle heavy-tailed noise, $\lambda$ needs to grow with the number of iterations $K$. The main contribution of this paper is to bridge this gap, providing a high-probability convergence bound for Clipped-SGD under both conditions simultaneously with an unified clipping threshold strategy. This result also successfully avoids the ex

Weaknesses

- Dependence on $1/\delta$: To establish the high-probability bound, Theorem 1 (case 2) requires the total number of iterations $K = \Omega(\frac{(L_{1}R_{0})^{2+\alpha}}{\delta})$. This polynomial dependence on $1/\delta$ is not standard comparing with $\log(1/\delta)$ in Theorem 1 (case 1). Could the authors comment on whether it might be possible to use more advanced probabilistic tools to improve the dependency to $\log(1/\delta)$? - As noted by the authors in the final section, the pape

Reviewer 03Rating 2Confidence 3

Strengths

- The presentation connects $(L_0,L_1)$-smoothness with heavy-tailed noise in a single framework and recovers several special cases. - Technically careful: the proofs are self-contained and the algorithmic template (standard clipping) is simple to implement. - The organization is very clear.

Weaknesses

- Problem novelty is weak. Both \emph{heavy-tailed} robustness for SGD (with clipping/truncation) and the \emph{$(L_0,L_1)$-smoothness} framework have been extensively studied; the paper largely resembles a \emph{combination} of two well-trodden threads (``A + B''), rather than introducing a new core idea or methodology. - Topic saturation and maturity. Techniques used (clipping-based potential arguments, tail-sensitive concentration) are standard in this area; the contribution reads as a consol

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSparse and Compressive Sensing Techniques · Advanced Bandit Algorithms Research · Risk and Portfolio Optimization

MethodsStochastic Gradient Descent