Abstract
We study first-order methods for convex optimization problems with functions satisfying the recently proposed -smoothness condition which generalizes the -smoothness and -smoothness. While accelerated gradient descent AGD is known to reach the optimal complexity under -smoothness, where is an error tolerance and is the distance between a starting and an optimal point, existing extensions to -smoothness either incur extra dependence on the initial gradient, suffer exponential factors in , or require costly auxiliary sub-routines, leaving open whether an AGD-type rate is possible for small-, even in the -smoothness case. We resolve this open question.…
Peer Reviews
Decision·Submitted to ICLR 2026
The topic is relevant to the research community, and the paper is well-structured and well-written. To the best of my knowledge, the related literature is appropriately covered, and significant parts of the technical approach are novel. The contribution is significant for both theory and practice, since it helps delineate the reach/limitations of classical methods under generalized smoothness, and can provide practitioners in, e.g., scientific computing fields, with potentially improved tools.
1. **Technical approach** * Assumption 2.3 states that "$ f : \mathbb{R}^d \to \mathbb{R} \cup \{\infty\} $ [...] attains its minimum at a (non-unique) $x^\ast \in \mathbb{R}^d $ [...]". However, none of the motivating examples in line 051 do satisfy this over $\mathbb{R}^d$ (for the case of $x^p$, consider $p$-odd). * The method addresses contrained optimization, yet the proof of Lemma B.3 uses the result of Lemma B.1 by replacing $\nabla f(y)$ with $\nabla f (x^\star)$ which is set to zero.
The primary strength of the paper is that it improves the previous $\ell(||\nabla f(x^0)||)$ dependence to $\ell(0)$ (up to considerations of additive terms, discussed below in "Weaknesses"), and this helps better place the result in the context of classic lower bound in smooth convex optimization.
One issue is that the algorithm needs $\Gamma_0$, $\bar{R}$. Do the other algorithms in Table 1 require these? If not, then the results are not directly comparable, and it would then be important to explain these caveats as an additional part of the table. The authors claim (erroneously) the complexity is optimal (line 250). The authors should specify the range of $\varepsilon$ where they claim optimality, and should emphasize the additive term which prevents them from actually being optimal. Th
They proposed algorithm which established the best-known oracle complexity in the small $\epsilon$ regime with tailored Lyapunov function. The results align with optimal complexity under $l$-smoothness condition and are empirically validated on a toy problem. The proof sketch is properly presented and Table 1 clearly exhibits the contribution.
To be honest, I’m uncertain that the paper’s contribution meets the bar for acceptance at this venue. While the paper establishes a best-known bound, the guarantee is confined to the small $\epsilon$ regime, and the constant factor improvement over Li et al. (2024a) seems somewhat incremental.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Optimization and Variational Analysis
