Adam Can Converge Without Any Modification On Update Rules
Yushun Zhang, Congliang Chen, Naichen Shi, Ruoyu Sun, Zhi-Quan Luo

TL;DR
This paper proves that Adam optimizer can converge to critical points without modifications if hyperparameters are chosen in a specific order, clarifying the gap between theory and practice.
Contribution
It provides the first theoretical convergence proof for Adam with standard update rules under realistic hyperparameter settings.
Findings
Adam converges to a neighborhood of critical points with large β₂ and β₁<√β₂<1.
Under strong growth condition, Adam converges to critical points.
There is a phase transition from divergence to convergence as β₂ increases.
Abstract
Ever since Reddi et al. 2018 pointed out the divergence issue of Adam, many new variants have been designed to obtain convergence. However, vanilla Adam remains exceptionally popular and it works well in practice. Why is there a gap between theory and practice? We point out there is a mismatch between the settings of theory and practice: Reddi et al. 2018 pick the problem after picking the hyperparameters of Adam, i.e., ; while practical applications often fix the problem first and then tune . Due to this observation, we conjecture that the empirical convergence can be theoretically justified, only if we change the order of picking the problem and hyperparameter. In this work, we confirm this conjecture. We prove that, when is large and , Adam converges to the neighborhood of critical points. The size of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Statistical Mechanics and Entropy
MethodsAdam
