Adam Converges Without Any Modification On Update Rules
Yushun Zhang, Bingran Li, Congliang Chen, Zhi-Quan Luo, Ruoyu Sun

TL;DR
This paper provides a rigorous theoretical analysis of Adam optimizer convergence, revealing a phase transition dependent on hyperparameters and batch size, with practical tuning suggestions supported by empirical evidence.
Contribution
It establishes convergence conditions for Adam based on hyperparameters, identifies a phase transition in their values, and offers practical tuning guidance for training neural networks.
Findings
Adam converges when is large and < \u221a
A phase transition from divergence to convergence exists in the (, ) plane
Tuning inversely with batch size improves training performance
Abstract
Adam is the default algorithm for training neural networks, including large language models (LLMs). However, \citet{reddi2019convergence} provided an example that Adam diverges, raising concerns for its deployment in AI model training. We identify a key mismatch between the divergence example and practice: \citet{reddi2019convergence} pick the problem after picking the hyperparameters of Adam, i.e., ; while practical applications often fix the problem first and then tune . In this work, we prove that Adam converges with proper problem-dependent hyperparameters. First, we prove that Adam converges when is large and . Second, when is small, we point out a region of combinations where Adam can diverge to infinity. Our results indicate a phase transition for Adam from divergence to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis · Neural Networks and Applications
