SGD with memory: fundamental properties and stochastic acceleration

Dmitry Yarotsky; Maksim Velikanov

arXiv:2410.04228·cs.LG·March 11, 2025

SGD with memory: fundamental properties and stochastic acceleration

Dmitry Yarotsky, Maksim Velikanov

PDF

Open Access 3 Reviews

TL;DR

This paper investigates how memory-augmented first-order methods can accelerate mini-batch SGD on quadratic problems with power-law spectra, showing that memory can improve convergence constants and potentially enhance the convergence rate.

Contribution

It introduces a framework for memory-$M$ algorithms, proves their properties, and demonstrates how memory-1 methods can improve SGD convergence constants and acceleration.

Findings

01

Memory-$M$ algorithms retain the GD convergence exponent $\xi$.

02

Memory-1 algorithms can reduce the constant $C_L$ arbitrarily while remaining stable.

03

A proposed memory-1 algorithm with a time-dependent schedule improves the convergence rate of SGD.

Abstract

An important open problem is the theoretically feasible acceleration of mini-batch SGD-type algorithms on quadratic problems with power-law spectrum. In the non-stochastic setting, the optimal exponent $ξ$ in the loss convergence $L_{t} \sim C_{L} t^{- ξ}$ is double that in plain GD and is achievable using Heavy Ball (HB) with a suitable schedule; this no longer works in the presence of mini-batch noise. We address this challenge by considering first-order methods with an arbitrary fixed number $M$ of auxiliary velocity vectors (*memory- $M$ algorithms*). We first prove an equivalence between two forms of such algorithms and describe them in terms of suitable characteristic polynomials. Then we develop a general expansion of the loss in terms of signal and noise propagators. Using it, we show that losses of stationary stable memory- $M$ algorithms always retain the exponent $ξ$ of plain…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 8Confidence 2

Strengths

Overall, this is a very good paper. I think it is very methodical in its development of the memory-M framework - starting with existing issues of accelerated HB for SGD, moving to the description of the generalized framework for GD and SGD, loss decomposition and improved algorithm and convergence properties. The topic coverage is very comprehensive, covering the expected questions raising with the development of the generalized framework. There are clearly a lot of efforts made towards the bala

Weaknesses

An expected consequence of the comprehensiveness of the paper is its density - making it relatively hard to read. As mentioned before, there was clearly a lot of effort put towards reorganization and moving details into the appendix - but I wonder if more could be done in that direction. For example, the Loss expansion section with the introduction of propagators seemed too dense - and was lacking intuitions that would facilitate further reading of the paper, since those concepts were heavily re

Reviewer 02Rating 8Confidence 3

Strengths

The paper's biggest strength is the establishment of strong convergence rates for iterative methods. Mini-batched algorithms are difficult to analyze; hence, establishing the convergence rate for regression in the noiseless regime is interesting. Additionally, the method allows for controlling the constant and shows that it can also be arbitrarily improved. I think both of these are important contributions. In addition, the paper introduces a general framework for analyzing such methods. i

Weaknesses

The writing of the paper is a major weakness. It is very dense notationally, not necessarily conceptually. This makes it quite challenging to build intuition as one reads the paper. One possible suggestion is that the paper develops the results initially in the $a = 0$ as the shifts are not needed for the main contributions and then provide extensions, possibly in the appendix. The other is to hide some of the intermediate steps to provide space for more intuition for the different objects. On

Reviewer 03Rating 3Confidence 2

Strengths

The work is a definite interest to read, and the results seem to hold a lot of promise.

Weaknesses

- Introduction: The reader who is not an expert in this field will find this introduction difficult to understand. I recommend that it be significantly reworked using various welcome tricks, such as tables. In general, the text is difficult to read.... - I don't fully understand about the experimental part. Is it just not there?

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Thermodynamics and Statistical Mechanics

MethodsStochastic Gradient Descent