Near-Optimal Convergence of Accelerated Gradient Methods under Generalized and $(L_0, L_1)$-Smoothness

Alexander Tyurin

arXiv:2508.06884·math.OC·May 22, 2026

Near-Optimal Convergence of Accelerated Gradient Methods under Generalized and $(L_0, L_1)$-Smoothness

Alexander Tyurin

PDF

3 Reviews

Abstract

We study first-order methods for convex optimization problems with functions $f$ satisfying the recently proposed $ℓ$ -smoothness condition $∣∣ \nabla^{2} f (x) ∣∣ \leq ℓ (∣∣\nabla f (x) ∣∣),$ which generalizes the $L$ -smoothness and $(L_{0}, L_{1})$ -smoothness. While accelerated gradient descent AGD is known to reach the optimal complexity $O (L R / ε)$ under $L$ -smoothness, where $ε$ is an error tolerance and $R$ is the distance between a starting and an optimal point, existing extensions to $ℓ$ -smoothness either incur extra dependence on the initial gradient, suffer exponential factors in $L_{1} R$ , or require costly auxiliary sub-routines, leaving open whether an AGD-type $O (ℓ (0) R / ε)$ rate is possible for small- $ε$ , even in the $(L_{0}, L_{1})$ -smoothness case. We resolve this open question.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

The topic is relevant to the research community, and the paper is well-structured and well-written. To the best of my knowledge, the related literature is appropriately covered, and significant parts of the technical approach are novel. The contribution is significant for both theory and practice, since it helps delineate the reach/limitations of classical methods under generalized smoothness, and can provide practitioners in, e.g., scientific computing fields, with potentially improved tools.

Weaknesses

1. **Technical approach** * Assumption 2.3 states that "$ f : \mathbb{R}^d \to \mathbb{R} \cup \{\infty\} $ [...] attains its minimum at a (non-unique) $x^\ast \in \mathbb{R}^d $ [...]". However, none of the motivating examples in line 051 do satisfy this over $\mathbb{R}^d$ (for the case of $x^p$, consider $p$-odd). * The method addresses contrained optimization, yet the proof of Lemma B.3 uses the result of Lemma B.1 by replacing $\nabla f(y)$ with $\nabla f (x^\star)$ which is set to zero.

Reviewer 02Rating 4Confidence 4

Strengths

The primary strength of the paper is that it improves the previous $\ell(||\nabla f(x^0)||)$ dependence to $\ell(0)$ (up to considerations of additive terms, discussed below in "Weaknesses"), and this helps better place the result in the context of classic lower bound in smooth convex optimization.

Weaknesses

One issue is that the algorithm needs $\Gamma_0$, $\bar{R}$. Do the other algorithms in Table 1 require these? If not, then the results are not directly comparable, and it would then be important to explain these caveats as an additional part of the table. The authors claim (erroneously) the complexity is optimal (line 250). The authors should specify the range of $\varepsilon$ where they claim optimality, and should emphasize the additive term which prevents them from actually being optimal. Th

Reviewer 03Rating 4Confidence 2

Strengths

They proposed algorithm which established the best-known oracle complexity in the small $\epsilon$ regime with tailored Lyapunov function. The results align with optimal complexity under $l$-smoothness condition and are empirically validated on a toy problem. The proof sketch is properly presented and Table 1 clearly exhibits the contribution.

Weaknesses

To be honest, I’m uncertain that the paper’s contribution meets the bar for acceptance at this venue. While the paper establishes a best-known bound, the guarantee is confined to the small $\epsilon$ regime, and the constant factor improvement over Li et al. (2024a) seems somewhat incremental.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Optimization and Variational Analysis