TL;DR
This paper investigates how gradient correlation influences the acceleration of stochastic gradient descent with momentum, providing theoretical insights and empirical validation in convex and neural network settings.
Contribution
It establishes that gradient correlation enables acceleration of SNAG by verifying the strong growth condition, a novel theoretical insight.
Findings
Gradient correlation verifies the strong growth condition for acceleration.
SNAG accelerates convergence in convex and neural network optimization.
Empirical results confirm theoretical predictions in practical scenarios.
Abstract
Empirically, it has been observed that adding momentum to Stochastic Gradient Descent (SGD) accelerates the convergence of the algorithm. However, the literature has been rather pessimistic, even in the case of convex functions, about the possibility of theoretically proving this observation. We investigate the possibility of obtaining accelerated convergence of the Stochastic Nesterov Accelerated Gradient (SNAG), a momentum-based version of SGD, when minimizing a sum of functions in a convex setting. We demonstrate that the average correlation between gradients allows to verify the strong growth condition, which is the key ingredient to obtain acceleration with SNAG. Numerical experiments, both in linear regression and deep neural network optimization, confirm in practice our theoretical results.
Peer Reviews
Decision·ICLR 2025 Poster
1. Originality: - Proposes the hypothesis that Stochastic Nesterov Accelerated Gradient (SNAG) can accelerate over Stochastic Gradient Descent (SGD) and proves that this hypothesis is valid when SNAG is under a Strong Growth Condition. - Provides new asymptotic almost sure convergence results for SNAG. - Gives the new characterization of the SGC constant by using the correlation between gradients. - Introduces a new condition named Relaxed Averaged COrrelated Gradient Assumption (RACOGA). 2. Q
* The text and formulas are a bit dense; the author can add a table to compare the convergence speed of SGD and SNAG under different conditions. * The graphs look good. However, that would be better if the author gave more detail about the explanation for the graph, for example, what the "small values" of RACOGA mean on the graph. * The colors in the right graph for Figure 1(a) are similar, author can use more contrasting colors.
Previous works have shown that stochastic versions of NAG converge at the same accelerated rates when the gradient estimates satisfy the strong growth condition (SGC). While they provide heuristics that suggest that SGC is a reasonable assumption in the context of overparametrized deep learning, it is not always clear when the condition is actually satisfied. This work addresses that gap in the literature. The authors show that for functions of the form $f=\sum_{i=1}^N f_i$, positive gradient co
1. Line 70: "However, even the question of the possibility to accelerate with SNAG in the convex setting is not solved yet." This is either unclear or inaccurate or both. There are several works which address the convergence of accelerated methods in the stochastic setting, both under SGC and with classical Robbins-Monro bounds, at least for smooth objectives. For a rigorous statement, the authors should specify the geometric assumptions, smoothness assumptions, and assumptions on the gradient o
1. The paper is well-organized and easy to read. The material in the supplement serves as a good complement to the main paper. Experimental results are presented in a clear way with nice plots and great details. 2. The main result is indeed very interesting to the community and gives some insight into a long-standing question. The theoretical contribution mainly comes from Theorem 4 which provides an almost surely convergence for SNAG showing a speed-up compared to SGD. 3. By proposing a new
1. Proof for theorem 4 heavily relies on an existing result (Sebbouh et al. 2021, theorem 9), which one could argue it weakens the theoretical contributions of this work. 2. I appreciate that the authors made an effort to compare RACOGA with gradient diversity and gradient confusion and agree with the authors that they are not identical, but they do look quite similar.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
