Nesterov acceleration in benignly non-convex landscapes

Kanan Gupta; Stephan Wojtowytsch

arXiv:2410.08395·math.OC·May 14, 2025

Nesterov acceleration in benignly non-convex landscapes

Kanan Gupta, Stephan Wojtowytsch

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper extends the theoretical understanding of Nesterov acceleration to benignly non-convex landscapes common in deep learning, showing similar guarantees as in convex settings, including stochastic variants.

Contribution

It demonstrates that Nesterov acceleration guarantees hold in benign non-convex landscapes, bridging the gap between theory and deep learning practice.

Findings

01

Nesterov acceleration achieves similar guarantees in benign non-convex landscapes as in convex cases.

02

The results apply to both continuous and discrete time models of NAG.

03

Stochastic NAG with additive and multiplicative noise also benefits from these guarantees.

Abstract

While momentum-based optimization algorithms are commonly used in the notoriously non-convex optimization problems of deep learning, their analysis has historically been restricted to the convex and strongly convex setting. In this article, we partially close this gap between theory and practice and demonstrate that virtually identical guarantees can be obtained in optimization problems with a `benign' non-convexity. We show that these weaker geometric assumptions are well justified in overparametrized deep learning, at least locally. Variations of this result are obtained for a continuous time model of Nesterov's accelerated gradient descent algorithm (NAG), the classical discrete time version of NAG, and versions of NAG with stochastic gradient estimates with purely additive noise and with noise that exhibits both additive and multiplicative scaling.

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 8Confidence 3

Strengths

1. The paper is very clearly written. Previous work is totally explained and the authors are very straightforward when describing their novel contribution (Lines 42-51 and Lines 106-107). 2. The convergence rates are a clear improvement upon previous work, which either achieves similar rates with stronger assumptions or achieves a non-accelerated rate for gradient flow under a similar setting as this work (aiming condition). 3. The authors are straightforward about the limitations of their work,

Weaknesses

1. The assumptions are inspired by overparameterized deep learning, but are unlikely to be completely accurate. This authors already comment on one instance of this (Lines 253-260). Another instance is the requirement that the objective is $C^1$-smooth (Line 114), which is not satisfied by non-smooth activation functions. Still, the assumptions are weaker than previous work, and I believe these gaps are very non-trivial to address. This is a minor weakness compared to the strengths of the paper.

Reviewer 02Rating 8Confidence 3

Strengths

1. Overall I think this is a nice paper with interesting contributions to the optimization community. I like the writing style of this paper, especially for the concise introduction and discussions of relevant works. 2. This paper justifies the superiority of momentum methods under weaker assumptions than strong convexity. Moreover, the assumptions made in this paper are supported by empirical evidence.

Weaknesses

While I do not check all the proofs, the paper does not seem to have many novel technical contributions compared with existing works on ODE modeling of optimization algorithms. I'm not sure if this should be considered as a weakness, since the results themselves are interesting.

Reviewer 03Rating 5Confidence 3

Strengths

There are some works have been reviewed in the literature, both from the line of momentum algorithms, and the geometry of deep neural networks. The paper is well-organized.

Weaknesses

1. To the best of my knowledge, both from theoretical and practical perspectives, the momentum methods commonly used in deep learning are based on the Polyak Heavy-ball method, not Nesterov acceleration. However, the paper under review focuses more on Nesterov acceleration, while mentioning motivation from deep learning without any references. Could the author carefully provide sufficient references to support this motivation? 2. The main contributions of the paper in comparison with related re

Videos

Nesterov acceleration in benignly non-convex landscapes· slideslive

Taxonomy

TopicsMathematical Biology Tumor Growth · Advanced Thermodynamics and Statistical Mechanics