Adaptive Extrapolated Proximal Gradient Methods with Variance Reduction for Composite Nonconvex Finite-Sum Minimization
Ganzhao Yuan

TL;DR
This paper introduces adaptive extrapolated proximal gradient methods with variance reduction, achieving optimal iteration complexity for composite nonconvex finite-sum minimization and demonstrating superior empirical performance.
Contribution
It presents the first Lipschitz-free methods, AEPG and AEPG-SPIDER, that attain optimal iteration complexity for this class of problems, incorporating adaptive stepsizes and variance reduction techniques.
Findings
Achieves optimal iteration complexity of (N ) for AEPG.
Achieves iteration complexity of (N + ) for AEPG-SPIDER.
Demonstrates superior empirical performance on sparse phase retrieval and linear eigenvalue problems.
Abstract
This paper proposes {\sf AEPG-SPIDER}, an Adaptive Extrapolated Proximal Gradient (AEPG) method with variance reduction for minimizing composite nonconvex finite-sum functions. It integrates three acceleration techniques: adaptive stepsizes, Nesterov's extrapolation, and the recursive stochastic path-integrated estimator SPIDER. Unlike existing methods that adjust the stepsize factor using historical gradients, {\sf AEPG-SPIDER} relies on past iterate differences for its update. While targeting stochastic finite-sum problems, {\sf AEPG-SPIDER} simplifies to {\sf AEPG} in the full-batch, non-stochastic setting, which is also of independent interest. To our knowledge, {\sf AEPG-SPIDER} and {\sf AEPG} are the first Lipschitz-free methods to achieve optimal iteration complexity for this class of \textit{composite} minimization problems. Specifically, {\sf AEPG} achieves the optimal…
Peer Reviews
Decision·Submitted to ICLR 2026
The proposed framework unifies adaptive stepsizes, Nesterov extrapolation, and variance reduction. This shows an effort to generalize various optimization components into a single algorithmic structure.
- The method combines several existing techniques (adaptive stepsize + extrapolation + variance reduction) but the motivation for this combination is vague. It is not clear what specific deficiency of prior algorithms is being addressed. The presentation is also dense and difficult to interpret, which hinders accessibility. - The claimed iteration complexities \(O(N\epsilon^{-2})\) and \(O(N+\sqrt{N}\epsilon^{-2})\) are identical to existing optimal results achieved by prior methods such a
1. The topic of the paper is interesting, since adaptive learning-rate free methods are of particular interest to the machine learning community. 2. The proposed method encompasses various techniques such as Nesterov momentum, AdaGrad type updates and variance reduction, potentially leading to faster algorithms. 3. The proposed method seems to outperform the rest of the methods in most experiments.
1. The quality of the presentation could improve substantially. There are a lot of inconsistencies in the notation and repetitions especially in the appendix. Please see the Questions section for some examples. 2. While many techniques are combined in the paper, the theoretical improvement over existing works in the unconstrained case is only in terms of eliminating a logarithmic term, as described in Remark 3.9. 3. The assumptions of the paper are not discussed enough and are not compared again
+ The combination of adaptive stepsizes (based on iterate differences), Nesterov extrapolation, and variance reduction in a single framework for composite nonconvex problems is novel. The proposed update for the extrapolation parameter \sigma^t is creative and central to the analysis. Moving away from gradient norms to iterate differences for stepsize adaptation is a clever way to handle the nonsmooth term h(x). + Significant theoretical results in optimal iteration complexity and non-ergodic co
- Assumption 3.1 is a standard but potentially restrictive assumption. While commonly used in existing analysis, its necessity should be discussed since many real-world problems are unconstrained. Assumption 4.5/4/10 require the stepsize to be sufficiently large, which is non-standard and not sufficiently validated in numerical and practical experiments. - The chosen problems (sparse phase retrieval, linear eigenvalue problems) are standard testbeds that have been used to validate proximal meth
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Sparse and Compressive Sensing Techniques · Markov Chains and Monte Carlo Methods
