TL;DR
This paper introduces DANA, a dimension-adapted Nesterov acceleration method that improves scaling laws for stochastic momentum algorithms, outperforming traditional SGD in high-dimensional settings and across various data complexities.
Contribution
The paper proposes DANA, a novel dimension-adapted momentum method that enhances scaling exponents and compute efficiency over traditional SGD with momentum.
Findings
DANA improves loss scaling exponents across data complexities.
Theoretical analysis matches experiments on synthetic and real data.
DANA outperforms SGD in high-dimensional LSTM training.
Abstract
We investigate scaling laws for stochastic momentum algorithms with small batch on the power law random features model, parameterized by data complexity, target complexity, and model size. When trained with a stochastic momentum algorithm, our analysis reveals four distinct loss curve shapes determined by varying data-target complexities. While traditional stochastic gradient descent with momentum (SGD-M) yields identical scaling law exponents to SGD, dimension-adapted Nesterov acceleration (DANA) improves these exponents by scaling momentum hyperparameters based on model size and data complexity. This outscaling phenomenon, which also improves compute-optimal scaling behavior, is achieved by DANA across a broad range of data and target complexities, while traditional methods fall short. Extensive experiments on high-dimensional synthetic quadratics validate our theoretical predictions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsStochastic Gradient Descent
