Why Adam Works Better with $\beta_1 = \beta_2$: The Missing Gradient Scale Invariance Principle
Alberto Fern\'andez-Hern\'andez, Cristian P\'erez-Corral, Jose I. Mestre, Manuel F. Dolz, Enrique S. Quintana-Ort\'i

TL;DR
This paper explains why Adam optimizer performs better when $eta_1 = eta_2$, linking it to a property called gradient scale invariance, supported by theoretical proofs and experiments.
Contribution
It formalizes gradient scale invariance and proves Adam achieves this property only when $eta_1 = eta_2$, guiding future optimizer design.
Findings
Adam becomes gradient scale invariant if and only if $eta_1 = eta_2$
Experiments show smoother effects of gradient rescaling when $eta_1 = eta_2$
The theory aligns with improved training behavior across vision and language tasks.
Abstract
Adam has been at the core of large-scale training for almost a decade, yet a simple empirical fact remains unaccounted for: both validation scores and the qualitative behaviour of the training runs improve when the momentum parameters satisfy . Some recent studies have reported this pattern, but there is still no explanation for why this choice helps. We show that this choice is closely tied to a structural property that we refer to as \textit{gradient scale invariance}. We formalize this notion and prove that Adam becomes gradient scale invariant of first order if and only if . This perspective places the balanced regime of Adam in direct alignment with the design principles underlying several recent optimizers that explicitly enforce scale-robust updates. The theory is supported by experiments across vision and language tasks, and across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
