Rethinking Adam: A Twofold Exponential Moving Average Approach

Yizhou Wang; Yue Kang; Can Qin; Huan Wang; Yi Xu; Yulun Zhang; Yun Fu

arXiv:2106.11514·cs.LG·February 10, 2022·1 cites

Rethinking Adam: A Twofold Exponential Moving Average Approach

Yizhou Wang, Yue Kang, Can Qin, Huan Wang, Yi Xu, Yulun Zhang, Yun Fu

PDF

Open Access 1 Repo

TL;DR

This paper introduces AdaMomentum, a new optimizer that improves training speed and generalization in deep learning by replacing the raw gradient with its momentumized version in the second moment estimate, backed by theory and extensive experiments.

Contribution

The paper proposes AdaMomentum, a novel optimizer that enhances Adam by using momentumized gradients for better generalization and convergence guarantees.

Findings

01

AdaMomentum outperforms existing optimizers in various tasks.

02

It achieves faster training with improved generalization.

03

Theoretical analysis supports its convergence in convex and nonconvex settings.

Abstract

Adaptive gradient methods, e.g. \textsc{Adam}, have achieved tremendous success in machine learning. Scaling the learning rate element-wisely by a certain form of second moment estimate of gradients, such methods are able to attain rapid training of modern deep neural networks. Nevertheless, they are observed to suffer from compromised generalization ability compared with stochastic gradient descent (\textsc{SGD}) and tend to be trapped in local minima at an early stage during training. Intriguingly, we discover that substituting the gradient in the second raw moment estimate term with its momentumized version in \textsc{Adam} can resolve the issue. The intuition is that gradient with momentum contains more accurate directional information and therefore its second moment estimation is a more favorable option for learning rate scaling than that of the raw gradient. Thereby we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wyzjack/AdaM3
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Machine Learning and ELM · Domain Adaptation and Few-Shot Learning

MethodsAdam