Simple Convergence Proof of Adam From a Sign-like Descent Perspective

Hanyang Peng; Shuang Qin; Yue Yu; Fangqing Jiang; Hui Wang; Zhouchen Lin

arXiv:2507.05966·cs.LG·July 10, 2025

Simple Convergence Proof of Adam From a Sign-like Descent Perspective

Hanyang Peng, Shuang Qin, Yue Yu, Fangqing Jiang, Hui Wang, Zhouchen Lin

PDF

Open Access 3 Reviews

TL;DR

This paper offers a simplified convergence proof for Adam by interpreting it as a sign-like optimizer, achieving optimal convergence rates under mild conditions and providing practical tuning insights.

Contribution

It introduces a novel sign-like perspective of Adam, simplifying the convergence analysis and establishing optimal rates without strong assumptions or dependence on model size.

Findings

01

Proves Adam converges at rate O(1/T^{1/4}) under mild conditions.

02

Provides practical guidelines for learning rate tuning.

03

Eliminates dependence on model dimensionality and epsilon.

Abstract

Adam is widely recognized as one of the most effective optimizers for training deep neural networks (DNNs). Despite its remarkable empirical success, its theoretical convergence analysis remains unsatisfactory. Existing works predominantly interpret Adam as a preconditioned stochastic gradient descent with momentum (SGDM), formulated as $x_{t + 1} = x_{t} - \frac{γ _{t}}{v _{t} + ϵ} \circ m_{t}$ . This perspective necessitates strong assumptions and intricate techniques, resulting in lengthy and opaque convergence proofs that are difficult to verify and extend. In contrast, we propose a novel interpretation by treating Adam as a sign-like optimizer, expressed as $x_{t + 1} = x_{t} - γ_{t} \frac{∣ m _{t} ∣}{v _{t} + ϵ} \circ Sign (m_{t})$ . This reformulation significantly simplifies the convergence analysis. For the first time,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 2Confidence 3

Strengths

- Studies Adam convergence from **sign-gradient-based algorithms** perspective - Builds on motivation from Balles and Hennig (2018) and related works, as noted on Page 3.

Weaknesses

- The measure ${1\over T} \sum_{t=1}^{T-1} \mathbb{E}|u_t \nabla F(x_t)|_1 $ in Theorem 2 could be meaningless, as $u_t^j$ in some cases could be very small. - Conditions 1-3 and the lower bound condition $E |m_t^j| /\sqrt{v_t^j + \epsilon} \geq \bar{v} >0$ are unnatural and strong. All existing convergence analysis for Adam in the nonconvex smooth optimization do not involve these assumptions. - The comparisons with state-of-the-art results in Table 1 are unfair because the authors do not acc

Reviewer 02Rating 2Confidence 4

Strengths

(a). The major contribution in this paper lies in the theoretical part, where the convergence rate for Adam is established under the general weak assumptions, which are more realistic to reflect the practical training. (b). The convergence rate is tight, and the result is novel to my best knowledge, in the context of general weak assumptions. (c). The treatment of Adam as a sign-like gradient descent, although not proposed at the first time, is still novel in the convergence analysis of Adam i

Weaknesses

My major concerns also lie in the theoretical part. It's clear to see that the convergence rate is established based not only on the weak assumptions such as generalized smoothness and affine noise, but also heavily on Conditions 1-3. These conditions, however, do not sound so natural and reasonable. Specifically, - Condition 1: Although the authors propose Proposition 1 to illustrate this condition, the assumption $\\|\nabla F(x\_t) \\|\_2$ decreases with $O(1/t^{\alpha})$ is not so reasonable

Reviewer 03Rating 2Confidence 4

Strengths

1. The problem considered is important. The convergence of Adam is a long-standing area of research that is very relevant to deep learning in general. 2. The paper is well-written and clear about its stated contributions.

Weaknesses

1. The reliance on the additional conditions (Conditions 1-3) is a significant weakness for several reasons. At best these conditions obscure the results, and at worst the conditions are simply not valid. a. I did not find any convincing evidence for Condition 1, as numerical validation is only provided for Conditions 2 and 3. This is odd considering that Condition 1 seems like it would be quite simple to verify/disprove numerically. Proposition 1 is intended to provide theoretical support

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications