Adam Converges Without Any Modification On Update Rules

Yushun Zhang; Bingran Li; Congliang Chen; Zhi-Quan Luo; Ruoyu Sun

arXiv:2603.02092·cs.LG·March 3, 2026

Adam Converges Without Any Modification On Update Rules

Yushun Zhang, Bingran Li, Congliang Chen, Zhi-Quan Luo, Ruoyu Sun

PDF

Open Access

TL;DR

This paper provides a rigorous theoretical analysis of Adam optimizer convergence, revealing a phase transition dependent on hyperparameters and batch size, with practical tuning suggestions supported by empirical evidence.

Contribution

It establishes convergence conditions for Adam based on hyperparameters, identifies a phase transition in their values, and offers practical tuning guidance for training neural networks.

Findings

01

Adam converges when is large and < \u221a

02

A phase transition from divergence to convergence exists in the (, ) plane

03

Tuning inversely with batch size improves training performance

Abstract

Adam is the default algorithm for training neural networks, including large language models (LLMs). However, \citet{reddi2019convergence} provided an example that Adam diverges, raising concerns for its deployment in AI model training. We identify a key mismatch between the divergence example and practice: \citet{reddi2019convergence} pick the problem after picking the hyperparameters of Adam, i.e., $(β_{1}, β_{2})$ ; while practical applications often fix the problem first and then tune $(β_{1}, β_{2})$ . In this work, we prove that Adam converges with proper problem-dependent hyperparameters. First, we prove that Adam converges when $β_{2}$ is large and $β_{1} < β_{2}$ . Second, when $β_{2}$ is small, we point out a region of $(β_{1}, β_{2})$ combinations where Adam can diverge to infinity. Our results indicate a phase transition for Adam from divergence to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis · Neural Networks and Applications