The AdEMAMix Optimizer: Better, Faster, Older

Matteo Pagliardini; Pierre Ablin; David Grangier

arXiv:2409.03137·cs.LG·October 1, 2024

The AdEMAMix Optimizer: Better, Faster, Older

Matteo Pagliardini, Pierre Ablin, David Grangier

PDF

Open Access 4 Repos 7 Models 1 Video 3 Reviews

TL;DR

The paper introduces AdEMAMix, a modified optimizer combining two EMAs to better utilize past gradients, leading to faster convergence, lower minima, and reduced forgetting in training large models.

Contribution

It proposes AdEMAMix, a novel optimizer that improves gradient accumulation by mixing two EMAs, addressing limitations of traditional single EMA methods.

Findings

01

Gradients remain relevant for tens of thousands of steps.

02

AdEMAMix accelerates convergence and finds lower minima.

03

Reduces model forgetting during training.

Abstract

Momentum based optimizers are central to a wide range of machine learning applications. These typically rely on an Exponential Moving Average (EMA) of gradients, which decays exponentially the present contribution of older gradients. This accounts for gradients being local linear approximations which lose their relevance as the iterate moves along the loss landscape. This work questions the use of a single EMA to accumulate past gradients and empirically demonstrates how this choice can be sub-optimal: a single EMA cannot simultaneously give a high weight to the immediate past, and a non-negligible weight to older gradients. Building on this observation, we propose AdEMAMix, a simple modification of the Adam optimizer with a mixture of two EMAs to better take advantage of past gradients. Our experiments on language modeling and image classification show -- quite surprisingly -- that…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

The prerformance of AdaEMAMix is quite impressive. The writing is excellent. The ablation studies are thorough.

Weaknesses

The motivation is rather vague. See below.

Reviewer 02Rating 10Confidence 4

Strengths

These are strong results on a very important problem. They also provide many optimizer ablations in the Appendix showing the robustness of their proposed optimizer.

Weaknesses

Since many of the experiments are with small batch size it would have been interesting to explore the effect of weight averaging. For example, is it the case that weight averaging helps AdamW and AdEMAMix equally? Or not?

Reviewer 03Rating 5Confidence 4

Strengths

- The paper is well written and easy to follow. The experiments on the Rosenbrock function are convincing. There are some similar (loss landscape) models proposed recently to analyze learning rate schemes [1]. It maybe interesting to draw some theory/experimental connections in the case of momentum. - The experiments seem to be comprehensive covering different settings and tasks. The improvement over baseline is shown. [1] Understanding Warmup-Stable-Decay Learning Rates: A River Valley Los

Weaknesses

- There are some additional hyperparameters introduced. Noticeably, it seems that $\alpha$ (that controls the mixing ratio) is important for the algorithm. It would be important to study the sensitivity of the algorithm to this hyperparameter, given that it essentially controls the contribution of each momentum buffer to the current update. - It would be better if some convergence guarantees of the algorithm can be provided even in the convex setting. For example, what is the relationship bet

Code & Models

Repositories

Models

Videos

The AdEMAMix Optimizer: Better, Faster, Older· slideslive

Taxonomy

TopicsHemodynamic Monitoring and Therapy · Fault Detection and Control Systems

MethodsAdamW · Adam · Adaptive EMA Mixture