Adam symmetry theorem: characterization of the convergence of the stochastic Adam optimizer

Steffen Dereich; Thang Do; Arnulf Jentzen; Philippe von Wurstemberger

arXiv:2511.06675·math.OC·November 11, 2025

Adam symmetry theorem: characterization of the convergence of the stochastic Adam optimizer

Steffen Dereich, Thang Do, Arnulf Jentzen, Philippe von Wurstemberger

PDF

Open Access

TL;DR

This paper rigorously analyzes the convergence of the Adam optimizer for strongly convex problems, establishing rates and conditions under which Adam converges, and introduces the Adam symmetry theorem highlighting the importance of data distribution symmetry.

Contribution

It provides the first rigorous convergence rates for Adam on strongly convex problems and introduces the Adam symmetry theorem showing convergence depends on data symmetry.

Findings

01

Convergence rate of 1/2 w.r.t. learning rate

02

Convergence rate of 1 w.r.t. mini-batch size

03

Convergence rate of 1 w.r.t. second moment parameter distance

Abstract

Beside the standard stochastic gradient descent (SGD) method, the Adam optimizer due to Kingma & Ba (2014) is currently probably the best-known optimization method for the training of deep neural networks in artificial intelligence (AI) systems. Despite the popularity and the success of Adam it remains an \emph{open research problem} to provide a rigorous convergence analysis for Adam even for the class of strongly convex SOPs. In one of the main results of this work we establish convergence rates for Adam in terms of the number of gradient steps (convergence rate \nicefrac{1}{2} w.r.t. the size of the learning rate), the size of the mini-batches (convergence rate 1 w.r.t. the size of the mini-batches), and the size of the second moment parameter of Adam (convergence rate 1 w.r.t. the distance of the second moment parameter to 1) for the class of strongly convex SOPs. In a further main…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Privacy-Preserving Technologies in Data · Neural Networks and Applications