Adam revisited: a weighted past gradients perspective
Hui Zhong, Zaiyi Chen, Chuan Qin, Zai Huang, Vincent W. Zheng, Tong, Xu, Enhong Chen

TL;DR
This paper introduces WADA, a new adaptive gradient method with a milder weighting strategy that improves convergence and regret bounds, explaining ADAM's practical success in training neural networks.
Contribution
The paper proposes WADA, a novel adaptive gradient algorithm with linear weighting on past gradients, and provides theoretical regret bounds and empirical validation.
Findings
WADA achieves better weighted regret bounds than ADAGRAD when gradients decrease rapidly.
Extensive experiments show WADA outperforms ADAM variants in training neural networks.
The milder weighting strategy improves convergence and practical performance.
Abstract
Adaptive learning rate methods have been successfully applied in many fields, especially in training deep neural networks. Recent results have shown that adaptive methods with exponential increasing weights on squared past gradients (i.e., ADAM, RMSPROP) may fail to converge to the optimal solution. Though many algorithms, such as AMSGRAD and ADAMNC, have been proposed to fix the non-convergence issues, achieving a data-dependent regret bound similar to or better than ADAGRAD is still a challenge to these methods. In this paper, we propose a novel adaptive method weighted adaptive algorithm (WADA) to tackle the non-convergence issues. Unlike AMSGRAD and ADAMNC, we consider using a milder growing weighting strategy on squared past gradient, in which weights grow linearly. Based on this idea, we propose weighted adaptive gradient method framework (WAGMF) and implement WADA algorithm on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAdam · AdaGrad · AMSGrad
