Dimension-adapted Momentum Outscales SGD

Damien Ferbach; Katie Everett; Gauthier Gidel; Elliot Paquette; Courtney Paquette

arXiv:2505.16098·stat.ML·May 23, 2025

Dimension-adapted Momentum Outscales SGD

Damien Ferbach, Katie Everett, Gauthier Gidel, Elliot Paquette, Courtney Paquette

PDF

1 Video

TL;DR

This paper introduces DANA, a dimension-adapted Nesterov acceleration method that improves scaling laws for stochastic momentum algorithms, outperforming traditional SGD in high-dimensional settings and across various data complexities.

Contribution

The paper proposes DANA, a novel dimension-adapted momentum method that enhances scaling exponents and compute efficiency over traditional SGD with momentum.

Findings

01

DANA improves loss scaling exponents across data complexities.

02

Theoretical analysis matches experiments on synthetic and real data.

03

DANA outperforms SGD in high-dimensional LSTM training.

Abstract

We investigate scaling laws for stochastic momentum algorithms with small batch on the power law random features model, parameterized by data complexity, target complexity, and model size. When trained with a stochastic momentum algorithm, our analysis reveals four distinct loss curve shapes determined by varying data-target complexities. While traditional stochastic gradient descent with momentum (SGD-M) yields identical scaling law exponents to SGD, dimension-adapted Nesterov acceleration (DANA) improves these exponents by scaling momentum hyperparameters based on model size and data complexity. This outscaling phenomenon, which also improves compute-optimal scaling behavior, is achieved by DANA across a broad range of data and target complexities, while traditional methods fall short. Extensive experiments on high-dimensional synthetic quadratics validate our theoretical predictions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Dimension-adapted Momentum Outscales SGD· slideslive

Taxonomy

MethodsStochastic Gradient Descent