NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer

Valentin Leplat; Daniil Merkulov; Aleksandr Katrutsa; Daniel; Bershatsky; Olga Tsymboi; Ivan Oseledets

arXiv:2209.14937·math.OC·October 3, 2023·1 cites

NAG-GS: Semi-Implicit, Accelerated and Robust Stochastic Optimizer

Valentin Leplat, Daniil Merkulov, Aleksandr Katrutsa, Daniel, Bershatsky, Olga Tsymboi, Ivan Oseledets

PDF

Open Access 2 Repos 3 Reviews

TL;DR

This paper introduces NAG-GS, a novel stochastic optimizer combining an accelerated Nesterov-like SDE with semi-implicit discretization, offering improved stability and convergence for training deep learning models.

Contribution

The paper proposes NAG-GS, a robust and accelerated stochastic optimizer based on a new SDE discretization, with theoretical convergence analysis and practical competitiveness.

Findings

01

NAG-GS achieves optimal convergence rates for quadratic functions.

02

NAG-GS is competitive with state-of-the-art optimizers like AdamW.

03

The method demonstrates stability and efficiency across various deep learning tasks.

Abstract

Classical machine learning models such as deep neural networks are usually trained by using Stochastic Gradient Descent-based (SGD) algorithms. The classical SGD can be interpreted as a discretization of the stochastic gradient flow. In this paper we propose a novel, robust and accelerated stochastic optimizer that relies on two key elements: (1) an accelerated Nesterov-like Stochastic Differential Equation (SDE) and (2) its semi-implicit Gauss-Seidel type discretization. The convergence and stability of the obtained method, referred to as NAG-GS, are first studied extensively in the case of the minimization of a quadratic function. This analysis allows us to come up with an optimal learning rate in terms of the convergence rate while ensuring the stability of NAG-GS. This is achieved by the careful analysis of the spectral radius of the iteration matrix and the covariance matrix at…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

The paper is largely well written with minor polishing still required. The method seems sound from a few empirical and numerical experiments conducted by the authors, and achieves competitive performance.

Weaknesses

The main weakness is that the convergence analysis is only for the quadratic case, and the convergence itself as stated in theorem 1 is a weak statement with only asymptotic convergence. I did not go through the entire proof, but i can understand that the analysis for general f (replacing Ax with grad f for the method) is non-trivial, since the "lifted" spectrum needs to be bounded effectively. It seems one also requires prior knowledge of \mu for the algorithm which is very limiting, and the c

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

- The NAG-GS is derived from an accelerated Stochastic Differential Equation (SDE) using its semi-implicit Gauss-Seidel type discretization, which is interesting. - The convergence analysis for the quadratic case is comprehensive.

Weaknesses

- The discussion of NAG-GS with other similar methods is insufficient. For example, Is NAG-GS faster than Polyak's momentum method for solving quadratic objectives? How does it compare with other variants of NAG, such as Triple momentum method [1, 2], and ITEM [3]? It is not clear what is the key benefit of NAG-GS in its original setting. - The improvement in the neural network experiments seems marginal. No deviation statistics is provided in the empirical results. - A minor point: As one of

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1- Connecting theoretical findings with practical applications and implementations. 2- One step toward practical implementations through considering stochastic extension of prior art in [Luo & Chen (2021)]. 3- Text is smooth and easy to read.

Weaknesses

1- Related work is very imited. The idea of analyzing accelerated methods through their continuous time perspective has been around for quite some time (since [Su, et. al (2014)] or even before that by [Alvarez & Attouch (2001)] and most of the related works mentioned deal with intrepretations of deterministic methods. It makes more sense to focus on works that see stochastic accelerated methods through the lens of ODEs since these are more related to the proposed research. 2- The theoretical a

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Advanced Neural Network Applications

MethodsWeight Decay · Logistic Regression · Stochastic Gradient Descent · AdamW