The Marginal Value of Momentum for Small Learning Rate SGD

Runzhe Wang; Sadhika Malladi; Tianhao Wang; Kaifeng Lyu; Zhiyuan Li

arXiv:2307.15196·cs.LG·April 17, 2024·1 cites

The Marginal Value of Momentum for Small Learning Rate SGD

Runzhe Wang, Sadhika Malladi, Tianhao Wang, Kaifeng Lyu, Zhiyuan Li

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the impact of momentum in stochastic gradient descent with small learning rates, revealing limited benefits for optimization and generalization in practical deep learning scenarios.

Contribution

Theoretical analysis clarifies that momentum offers minimal acceleration in stochastic settings with small learning rates, supported by empirical evidence.

Findings

01

Momentum has limited benefits in small learning rate regimes.

02

SGD with and without momentum behave similarly in stochastic settings.

03

Experiments on ImageNet and language models confirm limited advantages of momentum.

Abstract

Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise. In stochastic optimization, such as training neural networks, folklore suggests that momentum may help deep learning optimization by reducing the variance of the stochastic gradient update, but previous theoretical analyses do not find momentum to offer any provable acceleration. Theoretical results in this paper clarify the role of momentum in stochastic settings where the learning rate is small and gradient noise is the dominant source of instability, suggesting that SGD with and without momentum behave similarly in the short and long time horizons. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training regimes where the optimal learning rate is not very large, including small- to…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

1. The paper has demonstrated the limited benefit of momentum theoretically in some regimes, which helps suggest when not to use momentum. 2. The article has illustrated the idea with warmup examples, which makes it easier to digest.

Weaknesses

1. The analysis is based on the fact that the momentum effect is dominated by other factors when the learning rate is negligible. This is somewhat intuitive and needs to dive deeper into what the special role of momentum is. 2. There needs to be clear examples of what scenarios the theory fits in. For example, with what class of loss and neural network does the theory fit in? 3. The paper seems rushed in polishing. There are some links of reference in the paper that need to be added, for exa

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. The main results, i.e., Theorem 3.5 and Theorem 4.5, are strong in terms of both implications and proof techniques. 2. The results are presented in an orderly manner. 3. The result that the training trajectories of SGDM and SGD are similar is intriguing.

Weaknesses

1. Although the implications of Theorem 3.5 is clear, the statement of the theorem is a bit confusing. Specifically, the averaged learning schedule $\bar\eta_k$ is introduced, and it does not appear again until the appendix. 2. It would be interesting to see other concrete types of hyperparameter schedule apart from the constant ones, both theoretically and empirically. 3. It would also be interesting to see experiments demonstrating the main theoretical contribution of this paper, i.e., the tra

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

Prior research has noted that the advantage of incorporating momentum in the training of neural networks is somewhat limited, and there hasn't been a definitive theoretical outcome regarding the effectiveness of momentum in the stochastic setting. Therefore, demonstrating that momentum does not significantly reduce variance or improve generalization can lead to savings in computational resources and memory usage during deep neural network training. A noteworthy technical innovation in this work

Weaknesses

The assumptions that the paper considers to show that SGDM approximate SGD is too restrictive. It would be great if they can justify these too restrictive assumptions.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Machine Learning and ELM

MethodsStochastic Gradient Descent