Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants
Depen Morwani, Nikhil Vyas, Hanlin Zhang, Sham Kakade

TL;DR
This paper links recent schedule-free optimizers with theoretical accelerated SGD, demonstrating that AdEMAMix performs best and introducing a simplified version that maintains performance with less complexity.
Contribution
It establishes explicit theoretical connections between schedule-free optimizers and accelerated SGD, and proposes a simplified AdEMAMix variant with comparable performance.
Findings
AdEMAMix outperforms other optimizers in preliminary experiments.
Simplified-AdEMAMix matches AdEMAMix performance across batch sizes.
Theoretical links between optimizer classes are explicitly demonstrated.
Abstract
Recent advancements in deep learning optimization have introduced new algorithms, such as Schedule-Free optimizers, AdEMAMix, MARS and Lion which modify traditional momentum mechanisms. In a separate line of work, theoretical acceleration of stochastic gradient descent (SGD) in noise-dominated regime has been achieved by decoupling the momentum coefficient from the current gradient's weight. In this paper, we establish explicit connections between these two lines of work. We substantiate our theoretical findings with preliminary experiments on a 150m language modeling task. We find that AdEMAMix, which most closely resembles accelerated versions of stochastic gradient descent, exhibits superior performance. Building on these insights, we introduce a modification to AdEMAMix, termed Simplified-AdEMAMix, which maintains the same performance as AdEMAMix across both large and small…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Distributed and Parallel Computing Systems · Reservoir Engineering and Simulation Methods
MethodsAdaptive EMA Mixture · Evolved Sign Momentum
