Adam Exploits $\ell_\infty$-geometry of Loss Landscape via Coordinate-wise Adaptivity
Shuo Xie, Mohamad Amin Mohamadi, Zhiyuan Li

TL;DR
This paper reveals that Adam's advantage over SGD in training language models stems from exploiting the $\,\ell__ ext{infty}$-geometry of the loss landscape, supported by new convergence analysis and empirical evidence.
Contribution
It introduces a novel convergence analysis of Adam based on $\,\ell__ ext{infty}$-geometry assumptions, explaining its empirical success over SGD.
Findings
Adam performs better under $\,\ell__ ext{infty}$-geometry assumptions.
Changing $\,\ell__ ext{infty}$-geometry reduces Adam's effectiveness.
SGD remains unaffected by $\,\ell__ ext{infty}$-geometry changes.
Abstract
Adam outperforms SGD when training language models. Yet this advantage is not well-understood theoretically -- previous convergence analysis for Adam and SGD mainly focuses on the number of steps and is already minimax-optimal in non-convex cases, which are both . In this work, we argue that the exploitation of nice -geometry is the key advantage of Adam over SGD. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under -geometry rather than the more common -geometry, which yields a much better empirical smoothness constant for GPT-2 and ResNet models. Our experiments confirm that Adam performs much worse when the favorable -geometry is changed while SGD provably remains unaffected. We also extend the convergence analysis to blockwise Adam under novel…
Peer Reviews
Decision·ICLR 2025 Spotlight
1. The paper draws an insight that Adam is permutation-invariant, but not rotation-invariant is crucial, while SGD is rotation-invariant. I believe this property is highly related to the performance difference between Adam and SGD. 2. Based on $\ell_\infty$ smoothness measures, the paper provides a general framework to analyze Adam and its blockwise variants such as Adam-mini and Adalayer. This contribution is timely, given the increasing interest in blockwise optimization approaches aimed at
While the paper provides good insights into Adam's sensitivity to $\ell_\infty$ geometry, the proposed theorems using the $ ||\cdot ||_{1,1}$ norm may not fully capture this sensitivity, particularly in explaining the performance gap between SGD and Adam. Two specific concerns are as follows: - For convex problems with a positive semi-definite Hessian $B$, it holds that: $$ ||B||_{1,1} \geq \mathrm{trace}(B) \geq || B ||_2 $$ Thus, for a wide class of problems, we have $ \sup_x ||B||2 $ sma
1. Even if the paper could be even more polished (see Question 1 for a comprehensive list of needed corrections), the paper is overall very-well written and interesting. I carefully read the main text and the appendices and found the proofs very clear and did not find mathematical errors. 2. The authors introduce what seems to me is a novel framework to better capture Adam's coordinate-wise adaptivity, namely the $l\infty$ geometry that allows them to get tighter convergence bounds than previou
1. The convergence rate improvements seem a bit incremental when compared to (Défossez et al., 2022) and might not translate into practical gains. 2. The use of non-standard $l\infty$ smoothness assumption may limit the generalizability of the proposed results. 3. The empirical analysis of rotation sensitivity is very interesting but a bit limited in scope. The effort of including a ResNet-18 to explore a different kind of architecture is commendable but a more diverse set of architectures wou
1. The authors propose a unified algorithm that contains Adam, AdaSGD, and blockwise Adam. 2. The authors propose a more detailed assumption on gradient Lipschitz to characterize the underlying function carefully. Thus, they can give a tighter bound than giving the overall Lipschitz constant. 3. The authors find that (1,1)-norm of hessian is positively related to the performance of Adam in both theoretical analysis and experimental validation.
1. From my point of view the theorem in section 3.3 has already covered the results in section 3.2, making section 3.2 meaningless. 2. The authors claim that in their proof, we can see the reason that Adam can be better than SGD, while the explanation of the results is only given by $\sup_x ||\nabla^2 L(x)||_{1,1} \leq \sup_x ||\nabla^2 L(x)||_2$. It should have some reasonable examples. 3. In Table 1, since the convergence of AdaSGD is related to 2-norm instead of (1,1)-norm, why do the aut
Code & Models
Videos
Taxonomy
TopicsComputational Geometry and Mesh Generation · Markov Chains and Monte Carlo Methods · 3D Shape Modeling and Analysis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Attention Dropout · Attention Is All You Need · Discriminative Fine-Tuning · Linear Layer · Weight Decay · Convolution · Cosine Annealing · Dropout
