The Marginal Value of Momentum for Small Learning Rate SGD
Runzhe Wang, Sadhika Malladi, Tianhao Wang, Kaifeng Lyu, Zhiyuan Li

TL;DR
This paper investigates the impact of momentum in stochastic gradient descent with small learning rates, revealing limited benefits for optimization and generalization in practical deep learning scenarios.
Contribution
Theoretical analysis clarifies that momentum offers minimal acceleration in stochastic settings with small learning rates, supported by empirical evidence.
Findings
Momentum has limited benefits in small learning rate regimes.
SGD with and without momentum behave similarly in stochastic settings.
Experiments on ImageNet and language models confirm limited advantages of momentum.
Abstract
Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise. In stochastic optimization, such as training neural networks, folklore suggests that momentum may help deep learning optimization by reducing the variance of the stochastic gradient update, but previous theoretical analyses do not find momentum to offer any provable acceleration. Theoretical results in this paper clarify the role of momentum in stochastic settings where the learning rate is small and gradient noise is the dominant source of instability, suggesting that SGD with and without momentum behave similarly in the short and long time horizons. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training regimes where the optimal learning rate is not very large, including small- to…
Peer Reviews
Decision·ICLR 2024 poster
1. The paper has demonstrated the limited benefit of momentum theoretically in some regimes, which helps suggest when not to use momentum. 2. The article has illustrated the idea with warmup examples, which makes it easier to digest.
1. The analysis is based on the fact that the momentum effect is dominated by other factors when the learning rate is negligible. This is somewhat intuitive and needs to dive deeper into what the special role of momentum is. 2. There needs to be clear examples of what scenarios the theory fits in. For example, with what class of loss and neural network does the theory fit in? 3. The paper seems rushed in polishing. There are some links of reference in the paper that need to be added, for exa
1. The main results, i.e., Theorem 3.5 and Theorem 4.5, are strong in terms of both implications and proof techniques. 2. The results are presented in an orderly manner. 3. The result that the training trajectories of SGDM and SGD are similar is intriguing.
1. Although the implications of Theorem 3.5 is clear, the statement of the theorem is a bit confusing. Specifically, the averaged learning schedule $\bar\eta_k$ is introduced, and it does not appear again until the appendix. 2. It would be interesting to see other concrete types of hyperparameter schedule apart from the constant ones, both theoretically and empirically. 3. It would also be interesting to see experiments demonstrating the main theoretical contribution of this paper, i.e., the tra
Prior research has noted that the advantage of incorporating momentum in the training of neural networks is somewhat limited, and there hasn't been a definitive theoretical outcome regarding the effectiveness of momentum in the stochastic setting. Therefore, demonstrating that momentum does not significantly reduce variance or improve generalization can lead to savings in computational resources and memory usage during deep neural network training. A noteworthy technical innovation in this work
The assumptions that the paper considers to show that SGDM approximate SGD is too restrictive. It would be great if they can justify these too restrictive assumptions.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Advanced Neural Network Applications · Machine Learning and ELM
MethodsStochastic Gradient Descent
