ADOPT: Modified Adam Can Converge with Any $\beta_2$ with the Optimal Rate
Shohei Taniguchi, Keno Harada, Gouki Minegishi, Yuta Oshima, Seong, Cheol Jeong, Go Nagahara, Tomoshi Iiyama, Masahiro Suzuki, Yusuke Iwasawa,, Yutaka Matsuo

TL;DR
ADOPT is a new adaptive gradient method that guarantees convergence with any 2 choice, achieving optimal rates without the impractical bounded noise assumption, and outperforms Adam in various deep learning tasks.
Contribution
ADOPT introduces a novel modification to Adam that ensures convergence with any 2, removing the need for problem-dependent hyperparameter tuning and bounded noise assumptions.
Findings
ADOPT converges at the optimal rate 0. in theory.
ADOPT outperforms Adam and variants across multiple tasks.
ADOPT is robust to any 2 choice, simplifying hyperparameter tuning.
Abstract
Adam is one of the most popular optimization algorithms in deep learning. However, it is known that Adam does not converge in theory unless choosing a hyperparameter, i.e., , in a problem-dependent manner. There have been many attempts to fix the non-convergence (e.g., AMSGrad), but they require an impractical assumption that the gradient noise is uniformly bounded. In this paper, we propose a new adaptive gradient method named ADOPT, which achieves the optimal convergence rate of with any choice of without depending on the bounded noise assumption. ADOPT addresses the non-convergence issue of Adam by removing the current gradient from the second moment estimate and changing the order of the momentum update and the normalization by the second moment estimate. We also conduct intensive numerical experiments, and verify that our ADOPT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsComputability, Logic, AI Algorithms · Constraint Satisfaction and Optimization
MethodsADaptive gradient method with the OPTimal convergence rate · Adam
