Understanding Adam Requires Better Rotation Dependent Assumptions
Tianyue H. Zhang, Lucas Maes, Alan Milligan, Alexia Jolicoeur-Martineau, Ioannis Mitliagkas, Damien Scieur, Simon Lacoste-Julien, Charles Guille-Escuret

TL;DR
This paper explores Adam optimizer's sensitivity to rotations of the parameter space, revealing that its empirical success depends on basis choices and highlighting the need for rotation-aware theoretical models.
Contribution
It uncovers Adam's rotation sensitivity, challenges existing rotation-invariant assumptions, and proposes orthogonality of updates as a key factor for future theories.
Findings
Adam's performance degrades under random rotations
Structured rotations can preserve or improve Adam's performance
Orthogonality of updates correlates with basis sensitivity
Abstract
Despite its widespread adoption, Adam's advantage over Stochastic Gradient Descent (SGD) lacks a comprehensive theoretical explanation. This paper investigates Adam's sensitivity to rotations of the parameter space. We observe that Adam's performance in training transformers degrades under random rotations of the parameter space, indicating a crucial sensitivity to the choice of basis in practice. This reveals that conventional rotation-invariant assumptions are insufficient to capture Adam's advantages theoretically. To better understand the rotation-dependent properties that benefit Adam, we also identify structured rotations that preserve or even enhance its empirical performance. We then examine the rotation-dependent assumptions in the literature and find that they fall short in explaining Adam's behaviour across various rotation types. In contrast, we verify the orthogonality of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsDesign Education and Practice
MethodsAdam
