Optimistic Dual Averaging Unifies Modern Optimizers
Thomas Pethick, Wanyun Xie, Roman Machacek, Volkan Cevher

TL;DR
This paper introduces SODA, a unified framework for modern optimizers that improves training performance by eliminating weight decay tuning through a theoretically grounded schedule.
Contribution
The paper presents SODA, a generalization of Optimistic Dual Averaging, unifying several optimizers and providing a practical wrapper that enhances performance without extra hyperparameter tuning.
Findings
SODA improves optimizer performance across various tasks.
It eliminates the need for weight decay tuning.
Empirical results show consistent improvements.
Abstract
We introduce SODA, a generalization of Optimistic Dual Averaging, which provides a common perspective on state-of-the-art optimizers like Muon, Lion, AdEMAMix and NAdam, showing that they can all be viewed as optimistic instances of this framework. Based on this framing, we propose a practical SODA wrapper for any base optimizer that eliminates weight decay tuning through a theoretically-grounded decay schedule. Empirical results across various scales and training horizons show that SODA consistently improves performance without any additional hyperparameter tuning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
