TL;DR
This paper refines the analysis of Clipped SGD for nonsmooth convex optimization under heavy-tailed noise, achieving faster convergence rates and establishing optimality through new lower bounds.
Contribution
It provides improved convergence rates for Clipped SGD using a refined analysis and introduces the concept of generalized effective dimension.
Findings
Faster convergence rates under heavy-tailed noise.
New lower bounds matching the upper bounds.
Enhanced analysis utilizing Freedman's inequality.
Abstract
Optimization under heavy-tailed noise has become popular recently, since it better fits many modern machine learning tasks, as captured by empirical observations. Concretely, instead of a finite second moment on gradient noise, a bounded -th moment where has been recognized to be more realistic (say being upper bounded by for some ). A simple yet effective operation, gradient clipping, is known to handle this new challenge successfully. Specifically, Clipped Stochastic Gradient Descent (Clipped SGD) guarantees a high-probability rate (resp. ) for nonsmooth convex (resp. strongly convex) problems, where is the failure probability and is the time horizon.…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper provides a careful analysis that better utilizes Freedman’s inequality that leads to refined high-probability and in-expectation bounds. 2. This work generalizes the effective dimension into heavy-tailed robust optimization.
1. The boundedness of gradients is a strong assumption given that the motivation of heavy-tailed noise is typically training neural networks. 2. The improvements of the bound is mostly in terms of constant, and it is not clear whether this is a significant improvement. 3. No experiments provided. Some simple numerical examples to validate theory can also be helpful.
The rates of convergence in Theorems 1 and 2 are tighter compared to [Das et al., NeurIPS 2024], [Gorbunov et al., JOTA 2024], [Liu, Zhou, arXiv:2303.12277, 2023]. The present paper also removes some artefacts. For instance, in [Das et al., NeurIPS 2024] the upper bound blows up when $\sigma_s$ tends to zero.
1. The upper bound on $\mathbb E_{t - 1} X_t^2 = \mathbb E_{t - 1} \langle d_t^u, y_t\rangle^2$ is quite standard. The fact that $\mathbb E_{t - 1} X_t^2 \leq \|\| \mathbb E_{t - 1} [ d_t^u (d_t^u)^\top ] \|\|$ has been used in statistics. I am a bit surprised that it was overlooked in optimization. 2. According to Section 5, the main novelty of the paper is based on the refined bounds on $\|\| \mathbb E_{t - 1} [d_t^u (d_t^u)^\top]\|\|$ and $\|\|d_t^b\|\|$ stated in Lemma 1. Their proofs take
1)Improving convergence rates in both high-probability and in-expectation analyses. 2)Considering tighter assumption on heavy-tailed noise and introducing a generalized effective dimension $d_{\text{eff}}$. 3)Conducting a tighter analysis using martingale inequalities.
1)Bounded gradients are quite a restrictive assumption for Clipped SGD. 2)Following the previous point, the lower bound on iteration depends on $G$. Therefore, with exploding gradients we have to perform significantly more iterations. 3)The paper lacks a performance comparison of Clipped SGD with $\sigma_s\neq\sigma_l$, as well as their evaluation on the different datasets.
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Advanced Bandit Algorithms Research
