In Search of Adam's Secret Sauce
Antonio Orvieto, Robert M. Gower

TL;DR
This paper investigates why Adam optimizer performs so well in training language models, comparing it with simplified variants through extensive experiments, and uncovers a specific configuration that offers both strong performance and theoretical insights.
Contribution
The study provides a comprehensive empirical comparison of Adam and its variants, and introduces a constrained Adam variant with equal beta parameters that offers both practical benefits and theoretical understanding.
Findings
Signed momentum methods underperform compared to Adam.
Constraining beta1 = beta2 in Adam maintains performance and offers new theoretical insights.
Adam can be interpreted as an online mean and variance estimator from a variational inference perspective.
Abstract
Understanding the remarkable efficacy of Adam when training transformer-based language models has become a central research topic within the optimization community. To gain deeper insights, several simplifications of Adam have been proposed, such as the signed gradient and signed momentum methods. In this work, we conduct an extensive empirical study - training over 1500 language models across different data configurations and scales - comparing Adam to several known simplified variants. We find that signed momentum methods are faster than SGD, but consistently underperform relative to Adam, even after careful tuning of momentum, clipping setting and learning rates. However, our analysis reveals a compelling option that preserves near-optimal performance while allowing for new insightful reformulations: constraining the Adam momentum parameters to be equal, beta1 = beta2. Beyond robust…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Stochastic Gradient Optimization Techniques · Topic Modeling
