Rethinking Adam: A Twofold Exponential Moving Average Approach
Yizhou Wang, Yue Kang, Can Qin, Huan Wang, Yi Xu, Yulun Zhang, Yun Fu

TL;DR
This paper introduces AdaMomentum, a new optimizer that improves training speed and generalization in deep learning by replacing the raw gradient with its momentumized version in the second moment estimate, backed by theory and extensive experiments.
Contribution
The paper proposes AdaMomentum, a novel optimizer that enhances Adam by using momentumized gradients for better generalization and convergence guarantees.
Findings
AdaMomentum outperforms existing optimizers in various tasks.
It achieves faster training with improved generalization.
Theoretical analysis supports its convergence in convex and nonconvex settings.
Abstract
Adaptive gradient methods, e.g. \textsc{Adam}, have achieved tremendous success in machine learning. Scaling the learning rate element-wisely by a certain form of second moment estimate of gradients, such methods are able to attain rapid training of modern deep neural networks. Nevertheless, they are observed to suffer from compromised generalization ability compared with stochastic gradient descent (\textsc{SGD}) and tend to be trapped in local minima at an early stage during training. Intriguingly, we discover that substituting the gradient in the second raw moment estimate term with its momentumized version in \textsc{Adam} can resolve the issue. The intuition is that gradient with momentum contains more accurate directional information and therefore its second moment estimation is a more favorable option for learning rate scaling than that of the raw gradient. Thereby we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Domain Adaptation and Few-Shot Learning
MethodsAdam
