Adam Can Converge Without Any Modification On Update Rules

Yushun Zhang; Congliang Chen; Naichen Shi; Ruoyu Sun; Zhi-Quan Luo

arXiv:2208.09632·cs.LG·January 16, 2023·6 cites

Adam Can Converge Without Any Modification On Update Rules

Yushun Zhang, Congliang Chen, Naichen Shi, Ruoyu Sun, Zhi-Quan Luo

PDF

Open Access 1 Video

TL;DR

This paper proves that Adam optimizer can converge to critical points without modifications if hyperparameters are chosen in a specific order, clarifying the gap between theory and practice.

Contribution

It provides the first theoretical convergence proof for Adam with standard update rules under realistic hyperparameter settings.

Findings

01

Adam converges to a neighborhood of critical points with large β₂ and β₁<√β₂<1.

02

Under strong growth condition, Adam converges to critical points.

03

There is a phase transition from divergence to convergence as β₂ increases.

Abstract

Ever since Reddi et al. 2018 pointed out the divergence issue of Adam, many new variants have been designed to obtain convergence. However, vanilla Adam remains exceptionally popular and it works well in practice. Why is there a gap between theory and practice? We point out there is a mismatch between the settings of theory and practice: Reddi et al. 2018 pick the problem after picking the hyperparameters of Adam, i.e., $(β_{1}, β_{2})$ ; while practical applications often fix the problem first and then tune $(β_{1}, β_{2})$ . Due to this observation, we conjecture that the empirical convergence can be theoretically justified, only if we change the order of picking the problem and hyperparameter. In this work, we confirm this conjecture. We prove that, when $β_{2}$ is large and $β_{1} < β_{2} < 1$ , Adam converges to the neighborhood of critical points. The size of the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Adam Can Converge Without Any Modification On Update Rules· slideslive

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Markov Chains and Monte Carlo Methods · Statistical Mechanics and Entropy

MethodsAdam