Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis

Zijian Liu

arXiv:2512.23178·math.OC·May 19, 2026

Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis

Zijian Liu

PDF

1 Video 3 Reviews

TL;DR

This paper refines the analysis of Clipped SGD for nonsmooth convex optimization under heavy-tailed noise, achieving faster convergence rates and establishing optimality through new lower bounds.

Contribution

It provides improved convergence rates for Clipped SGD using a refined analysis and introduces the concept of generalized effective dimension.

Findings

01

Faster convergence rates under heavy-tailed noise.

02

New lower bounds matching the upper bounds.

03

Enhanced analysis utilizing Freedman's inequality.

Abstract

Optimization under heavy-tailed noise has become popular recently, since it better fits many modern machine learning tasks, as captured by empirical observations. Concretely, instead of a finite second moment on gradient noise, a bounded $p$ -th moment where $p \in (1, 2]$ has been recognized to be more realistic (say being upper bounded by $σ_{l}^{p}$ for some $σ_{l} \geq 0$ ). A simple yet effective operation, gradient clipping, is known to handle this new challenge successfully. Specifically, Clipped Stochastic Gradient Descent (Clipped SGD) guarantees a high-probability rate $O (σ_{l} ln (1/ δ) T^{1/ p - 1})$ (resp. $O (σ_{l}^{2} ln^{2} (1/ δ) T^{2/ p - 2})$ ) for nonsmooth convex (resp. strongly convex) problems, where $δ \in (0, 1]$ is the failure probability and $T \in N$ is the time horizon.…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

1. This paper provides a careful analysis that better utilizes Freedman’s inequality that leads to refined high-probability and in-expectation bounds. 2. This work generalizes the effective dimension into heavy-tailed robust optimization.

Weaknesses

1. The boundedness of gradients is a strong assumption given that the motivation of heavy-tailed noise is typically training neural networks. 2. The improvements of the bound is mostly in terms of constant, and it is not clear whether this is a significant improvement. 3. No experiments provided. Some simple numerical examples to validate theory can also be helpful.

Reviewer 02Rating 6Confidence 3

Strengths

The rates of convergence in Theorems 1 and 2 are tighter compared to [Das et al., NeurIPS 2024], [Gorbunov et al., JOTA 2024], [Liu, Zhou, arXiv:2303.12277, 2023]. The present paper also removes some artefacts. For instance, in [Das et al., NeurIPS 2024] the upper bound blows up when $\sigma_s$ tends to zero.

Weaknesses

1. The upper bound on $\mathbb E_{t - 1} X_t^2 = \mathbb E_{t - 1} \langle d_t^u, y_t\rangle^2$ is quite standard. The fact that $\mathbb E_{t - 1} X_t^2 \leq \|\| \mathbb E_{t - 1} [ d_t^u (d_t^u)^\top ] \|\|$ has been used in statistics. I am a bit surprised that it was overlooked in optimization. 2. According to Section 5, the main novelty of the paper is based on the refined bounds on $\|\| \mathbb E_{t - 1} [d_t^u (d_t^u)^\top]\|\|$ and $\|\|d_t^b\|\|$ stated in Lemma 1. Their proofs take

Reviewer 03Rating 6Confidence 3

Strengths

1)Improving convergence rates in both high-probability and in-expectation analyses. 2)Considering tighter assumption on heavy-tailed noise and introducing a generalized effective dimension $d_{\text{eff}}$. 3)Conducting a tighter analysis using martingale inequalities.

Weaknesses

1)Bounded gradients are quite a restrictive assumption for Clipped SGD. 2)Following the previous point, the lower bound on iteration depends on $G$. Therefore, with exploding gradients we have to perform significantly more iterations. 3)The paper lacks a performance comparison of Clipped SGD with $\sigma_s\neq\sigma_l$, as well as their evaluation on the different datasets.

Videos

Clipped Gradient Methods for Nonsmooth Convex Optimization under Heavy-Tailed Noise: A Refined Analysis· slideslive

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Advanced Bandit Algorithms Research