Optimistic Dual Averaging Unifies Modern Optimizers

Thomas Pethick; Wanyun Xie; Roman Machacek; Volkan Cevher

arXiv:2605.11172·cs.LG·May 13, 2026

Optimistic Dual Averaging Unifies Modern Optimizers

Thomas Pethick, Wanyun Xie, Roman Machacek, Volkan Cevher

PDF

TL;DR

This paper introduces SODA, a unified framework for modern optimizers that improves training performance by eliminating weight decay tuning through a theoretically grounded schedule.

Contribution

The paper presents SODA, a generalization of Optimistic Dual Averaging, unifying several optimizers and providing a practical wrapper that enhances performance without extra hyperparameter tuning.

Findings

01

SODA improves optimizer performance across various tasks.

02

It eliminates the need for weight decay tuning.

03

Empirical results show consistent improvements.

Abstract

We introduce SODA, a generalization of Optimistic Dual Averaging, which provides a common perspective on state-of-the-art optimizers like Muon, Lion, AdEMAMix and NAdam, showing that they can all be viewed as optimistic instances of this framework. Based on this framing, we propose a practical SODA wrapper for any base optimizer that eliminates weight decay tuning through a theoretically-grounded $1/ k$ decay schedule. Empirical results across various scales and training horizons show that SODA consistently improves performance without any additional hyperparameter tuning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.