Flatter, faster: scaling momentum for optimal speedup of SGD

Aditya Cowsik; Tankut Can; Paolo Glorioso

arXiv:2210.16400·cs.LG·June 14, 2023

Flatter, faster: scaling momentum for optimal speedup of SGD

Aditya Cowsik, Tankut Can, Paolo Glorioso

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a principled way to scale the momentum hyperparameter in SGD, achieving optimal training speedup without sacrificing generalization, by analyzing training dynamics in overparametrized neural networks.

Contribution

It develops an architecture-independent framework that determines how to scale momentum with learning rate for maximal acceleration in SGD training.

Findings

01

Scaling momentum as (1 - β) ∝ (learning rate)^{2/3} maximizes training speedup.

02

The proposed scaling rule is validated on synthetic and real datasets, showing robustness.

03

Training dynamics reveal two characteristic timescales that meet at optimal acceleration.

Abstract

Commonly used optimization algorithms often show a trade-off between good generalization and fast training times. For instance, stochastic gradient descent (SGD) tends to have good generalization; however, adaptive gradient methods have superior training times. Momentum can help accelerate training with SGD, but so far there has been no principled way to select the momentum hyperparameter. Here we study training dynamics arising from the interplay between SGD with label noise and momentum in the training of overparametrized neural networks. We find that scaling the momentum hyperparameter $1 - β$ with the learning rate to the power of $2/3$ maximally accelerates training, without sacrificing generalization. To analytically derive this result we develop an architecture-independent framework, where the main assumption is the existence of a degenerate manifold of global minimizers, as is…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

- The paper is well written and well organized. - The paper provides numerical experiments on synthetic and natural settings. - Both rigorous proofs and heuristics arguments are given. - The theoretical analysis of the timescales provides the simple prescription gamma=2/3 that is shown to speed up training and also give the best generalization in some settings.

Weaknesses

- The work seems a relatively straightforward extension of Li et al. (2022) - It is not clear if the optimality of the gamma=2/3 exponent applies to standard SGD as well. - The theory does not give a prescription for the prefactor C. I think this makes it not so useful in practice.

Reviewer 02Rating 3· reject, not good enoughConfidence 3

Strengths

1. It is an important topic to study how the momentum hyperparameter is optimally picked in deep learning, as it allows for shrinking the search space of hyperparameters. 2. The theoretical results (at least the part I have understood) are solid.

Weaknesses

1. The biggest concern I have is regarding the presentation of this paper. The quantities $\tau_1$ and $\tau_2$ are the focus of this paper. But it is only defined through informal descriptions (for example " Define $\tau_2$ to be the number of time steps it takes so that the displacement of $w_k$ along $\Gamma$ becomes a finite number as we take $\epsilon$ → 0 first, and $\eta$ → 0 afterward") in the introduction, and no (even informal) definition elsewhere. This makes these two terms extremely

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

I would like to thank the authors for the clear and sufficiently detailed introduction section that provides all the necessary results and ideas. The interpretation of the convergence as a trade-off between two main scales (longitudinal and traversal) looks interesting and promising for further development in the case of adaptive step size optimizers. The presented experimental results in Section 4 confirm the obtained theoretical dependence of the momentum hyperparameter on the learning rate. T

Weaknesses

The weaknesses of the presented study are listed below 1) It is very hard to read Section 3 for non-experts in the stochastic process theory. I suggest the authors compress it and extend the section with experiment evaluation. 2) Figure 2 presents fitting line results that confirm the estimated dependence rule, however, I do not find the analysis of the variance of the derived estimate. I am not sure that if one adds more points to the plots, then the dependence is changed significantly.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Machine Learning and Data Classification · Stochastic Gradient Optimization Techniques

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Average Pooling · 1x1 Convolution · Batch Normalization · Global Average Pooling · Kaiming Initialization · Max Pooling · Residual Connection · Bottleneck Residual Block · Residual Block