TL;DR
The paper introduces AdEMAMix, a modified optimizer combining two EMAs to better utilize past gradients, leading to faster convergence, lower minima, and reduced forgetting in training large models.
Contribution
It proposes AdEMAMix, a novel optimizer that improves gradient accumulation by mixing two EMAs, addressing limitations of traditional single EMA methods.
Findings
Gradients remain relevant for tens of thousands of steps.
AdEMAMix accelerates convergence and finds lower minima.
Reduces model forgetting during training.
Abstract
Momentum based optimizers are central to a wide range of machine learning applications. These typically rely on an Exponential Moving Average (EMA) of gradients, which decays exponentially the present contribution of older gradients. This accounts for gradients being local linear approximations which lose their relevance as the iterate moves along the loss landscape. This work questions the use of a single EMA to accumulate past gradients and empirically demonstrates how this choice can be sub-optimal: a single EMA cannot simultaneously give a high weight to the immediate past, and a non-negligible weight to older gradients. Building on this observation, we propose AdEMAMix, a simple modification of the Adam optimizer with a mixture of two EMAs to better take advantage of past gradients. Our experiments on language modeling and image classification show -- quite surprisingly -- that…
Peer Reviews
Decision·ICLR 2025 Poster
The prerformance of AdaEMAMix is quite impressive. The writing is excellent. The ablation studies are thorough.
The motivation is rather vague. See below.
These are strong results on a very important problem. They also provide many optimizer ablations in the Appendix showing the robustness of their proposed optimizer.
Since many of the experiments are with small batch size it would have been interesting to explore the effect of weight averaging. For example, is it the case that weight averaging helps AdamW and AdEMAMix equally? Or not?
- The paper is well written and easy to follow. The experiments on the Rosenbrock function are convincing. There are some similar (loss landscape) models proposed recently to analyze learning rate schemes [1]. It maybe interesting to draw some theory/experimental connections in the case of momentum. - The experiments seem to be comprehensive covering different settings and tasks. The improvement over baseline is shown. [1] Understanding Warmup-Stable-Decay Learning Rates: A River Valley Los
- There are some additional hyperparameters introduced. Noticeably, it seems that $\alpha$ (that controls the mixing ratio) is important for the algorithm. It would be important to study the sensitivity of the algorithm to this hyperparameter, given that it essentially controls the contribution of each momentum buffer to the current update. - It would be better if some convergence guarantees of the algorithm can be provided even in the convex setting. For example, what is the relationship bet
Code & Models
- 🤗primeline/whisper-large-v3-turbo-germanmodel· 2.7k dl· ♡ 552.7k dl♡ 55
- 🤗cstr/whisper-large-v3-turbo-german-int8_float32model· 24 dl· ♡ 224 dl♡ 2
- 🤗cstr/whisper-large-v3-turbo-german-ggmlmodel· ♡ 3♡ 3
- 🤗jimmymeister/whisper-large-v3-turbo-german-ct2model· 960 dl· ♡ 4960 dl♡ 4
- 🤗primeline/whisper-tiny-german-1224model· 301 dl· ♡ 15301 dl♡ 15
- 🤗EvgenyShivchenkoUIT/bw-voice_recog_demodel
- 🤗EvgenyShivchenkoUIT/bw-voice_recog_de_turbomodel· 22 dl22 dl
Videos
Taxonomy
TopicsHemodynamic Monitoring and Therapy · Fault Detection and Control Systems
MethodsAdamW · Adam · Adaptive EMA Mixture
