Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity

Shuo Xie; Mohamad Amin Mohamadi; Zhiyuan Li

arXiv:2410.08198·cs.LG·June 12, 2025

Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity

Shuo Xie, Mohamad Amin Mohamadi, Zhiyuan Li

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper reveals that Adam's advantage over SGD in training language models stems from exploiting the $\,\ell__ ext{infty}$-geometry of the loss landscape, supported by new convergence analysis and empirical evidence.

Contribution

It introduces a novel convergence analysis of Adam based on $\,\ell__ ext{infty}$-geometry assumptions, explaining its empirical success over SGD.

Findings

01

Adam performs better under $\,\ell__ ext{infty}$-geometry assumptions.

02

Changing $\,\ell__ ext{infty}$-geometry reduces Adam's effectiveness.

03

SGD remains unaffected by $\,\ell__ ext{infty}$-geometry changes.

Abstract

Adam outperforms SGD when training language models. Yet this advantage is not well-understood theoretically -- previous convergence analysis for Adam and SGD mainly focuses on the number of steps $T$ and is already minimax-optimal in non-convex cases, which are both $O (T^{- 1/4})$ . In this work, we argue that the exploitation of nice $ℓ_{\infty}$ -geometry is the key advantage of Adam over SGD. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under $ℓ_{\infty}$ -geometry rather than the more common $ℓ_{2}$ -geometry, which yields a much better empirical smoothness constant for GPT-2 and ResNet models. Our experiments confirm that Adam performs much worse when the favorable $ℓ_{\infty}$ -geometry is changed while SGD provably remains unaffected. We also extend the convergence analysis to blockwise Adam under novel…

Peer Reviews

Decision·ICLR 2025 Spotlight

Reviewer 01Rating 6Confidence 4

Strengths

1. The paper draws an insight that Adam is permutation-invariant, but not rotation-invariant is crucial, while SGD is rotation-invariant. I believe this property is highly related to the performance difference between Adam and SGD. 2. Based on $\ell_\infty$ smoothness measures, the paper provides a general framework to analyze Adam and its blockwise variants such as Adam-mini and Adalayer. This contribution is timely, given the increasing interest in blockwise optimization approaches aimed at

Weaknesses

While the paper provides good insights into Adam's sensitivity to $\ell_\infty$ geometry, the proposed theorems using the $ ||\cdot ||_{1,1}$ norm may not fully capture this sensitivity, particularly in explaining the performance gap between SGD and Adam. Two specific concerns are as follows: - For convex problems with a positive semi-definite Hessian $B$, it holds that: $$ ||B||_{1,1} \geq \mathrm{trace}(B) \geq || B ||_2 $$ Thus, for a wide class of problems, we have $ \sup_x ||B||2 $ sma

Reviewer 02Rating 8Confidence 4

Strengths

1. Even if the paper could be even more polished (see Question 1 for a comprehensive list of needed corrections), the paper is overall very-well written and interesting. I carefully read the main text and the appendices and found the proofs very clear and did not find mathematical errors. 2. The authors introduce what seems to me is a novel framework to better capture Adam's coordinate-wise adaptivity, namely the $l\infty$ geometry that allows them to get tighter convergence bounds than previou

Weaknesses

1. The convergence rate improvements seem a bit incremental when compared to (Défossez et al., 2022) and might not translate into practical gains. 2. The use of non-standard $l\infty$ smoothness assumption may limit the generalizability of the proposed results. 3. The empirical analysis of rotation sensitivity is very interesting but a bit limited in scope. The effort of including a ResNet-18 to explore a different kind of architecture is commendable but a more diverse set of architectures wou

Reviewer 03Rating 8Confidence 4

Strengths

1. The authors propose a unified algorithm that contains Adam, AdaSGD, and blockwise Adam. 2. The authors propose a more detailed assumption on gradient Lipschitz to characterize the underlying function carefully. Thus, they can give a tighter bound than giving the overall Lipschitz constant. 3. The authors find that (1,1)-norm of hessian is positively related to the performance of Adam in both theoretical analysis and experimental validation.

Weaknesses

1. From my point of view the theorem in section 3.3 has already covered the results in section 3.2, making section 3.2 meaningless. 2. The authors claim that in their proof, we can see the reason that Adam can be better than SGD, while the explanation of the results is only given by $\sup_x ||\nabla^2 L(x)||_{1,1} \leq \sup_x ||\nabla^2 L(x)||_2$. It should have some reasonable examples. 3. In Table 1, since the convergence of AdaSGD is related to 2-norm instead of (1,1)-norm, why do the aut

Code & Models

Repositories

mohamad-amin/adam-coordinate-adaptivity
jaxOfficial

Videos

Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity· slideslive

Taxonomy

TopicsComputational Geometry and Mesh Generation · Markov Chains and Monte Carlo Methods · 3D Shape Modeling and Analysis

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Attention Dropout · Attention Is All You Need · Discriminative Fine-Tuning · Linear Layer · Weight Decay · Convolution · Cosine Annealing · Dropout