Optimizing $(L_0, L_1)$-Smooth Functions by Gradient Methods
Daniil Vankov, Anton Rodomanov, Angelia Nedich, Lalitha Sankar,, Sebastian U. Stich

TL;DR
This paper introduces new optimization techniques and complexity bounds for the class of $(L_0, L_1)$-smooth functions, enhancing convergence guarantees for gradient methods in both convex and nonconvex settings.
Contribution
The paper develops a unified framework for analyzing gradient methods on $(L_0, L_1)$-smooth functions, improving complexity bounds and extending accelerated methods to this class.
Findings
Improved complexity bounds for convex $(L_0, L_1)$-smooth functions.
Nearly optimal complexity guarantees for gradient methods with Polyak stepsizes.
Application of accelerated gradient methods to this function class with enhanced performance.
Abstract
We study gradient methods for optimizing -smooth functions, a class that generalizes Lipschitz-smooth functions and has gained attention for its relevance in machine learning. We provide new insights into the structure of this function class and develop a principled framework for analyzing optimization methods in this setting. While our convergence rate estimates recover existing results for minimizing the gradient norm in nonconvex problems, our approach significantly improves the best-known complexity bounds for convex objectives. Moreover, we show that the gradient method with Polyak stepsizes and the normalized gradient method achieve nearly the same complexity guarantees as methods that rely on explicit knowledge of~. Finally, we demonstrate that a carefully designed accelerated gradient method can be applied to -smooth functions, further…
Peer Reviews
Decision·ICLR 2025 Poster
As mentioned in the Summary section, this paper nicely demonstrates the parallelism of how the traditional techniques from smooth (convex) minimization should be properly interpreted and applied to the class of $(L_0, L_1)$-smooth functions. As a reader who is much more familiar with the classical optimization theory but not as much with the recent theory of $(L_0, L_1)$-smooth optimization, the paper was interesting, easily readable, and informative. It seems such viewpoint has not been provide
The primary weakness of the current version of the paper is the writing. While I think the technical contents are good, the writing does not seem to be polished carefully enough and should be managed before the publication. The general flow is okay, but there are many detailed points which I would recommend the authors to address; please refer to the Questions section. I would recommend acceptance of the paper provided that the writing issues get resolved.
1. The paper is well-written and easy to follow. The analysis is thorough and offers many insights for research in optimization. 2. I appreciate the comparison between this paper and Gorbunov et al. (2024), which is comprehensive and particularly crucial for readers. 3. I have attempted to implement some of the suggested algorithms and found that the analysis presented here is essential for practical use, especially when the optimal step size in equation (11) is given explicitly.
The following are some major concerns about the paper. 1. The numerical experiments are conducted only on simple functions. I suggest that the authors significantly improve this section by employing more diverse examples, particularly in nonconvex cases. For examples, could the authors consider loss functions from deep learning models or other challenging nonconvex optimization problems from applications? 2. If I have not overlooked anything, the current paper does not discuss the difficulties
To me, one of this paper's main strengths/contributions is the derivation of tighter first-order upper and lower bounds. The bounds will undoubtedly help any further work on analyzing this particular function class. In Theorems 3.2 and 5.1, the author(s) is able to dispense $L$-gradient Lipschitz, which was additionally assumed in the previous works Koloskova et al 2023, while improving the current existing rate. The overall flow of the paper is well-presented.
1. Numerical results can be strengthened by providing more experiments. 2. For Theorem 3.1, even though your rate is better than the rate provided in Hubler et al 2024, they do not have the dependency on $L_0$ and $L_1$ (See section Questions, 1.). 3. (line 269) I believe "By Theorem 3.1, the smallest number K of iterations required to achieve ..." is incorrect, as one can happen to choose an initial point such that $\|\| \nabla f(x_{0}) || < \epsilon_{\mathbf{g}}$. In this case, $K = 0$, whic
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Numerical Analysis Techniques · Advanced Optimization Algorithms Research · Optimization and Variational Analysis
