From SGD to Muon: Adaptive Optimization via Schatten-p Norms
Thomas Massena (IRIT, DTIPG - SNCF, UT3), Corentin Friedrich, Mathieu Serrurier (IRIT)

TL;DR
This paper introduces an adaptive optimizer for deep neural networks that dynamically selects matrix geometries based on data, outperforming fixed-geometry optimizers like Muon and AdamW.
Contribution
The authors propose a novel data-driven criterion for dynamically choosing LMO geometries in optimizers, unifying various optimizers and improving performance with minimal overhead.
Findings
The adaptive optimizer outperforms or matches the best fixed-geometry optimizers across multiple scenarios.
The method achieves only about 3% additional runtime overhead.
It demonstrates the feasibility of data-driven geometry adaptation in optimization.
Abstract
Modern optimizers, like Muon, impose matrix-wise geometry constraints on their updates. These matrix-wise constraints can be unified under Linear Minimization Oracle (LMO) theory. However, all current methods impose fixed LMO geometries for the update rules, chosen by-design or empirically, which are not necessarily optimal according to the problem's geometry. We introduce a novel efficient datadriven criterion for dynamically choosing proxy-optimal update LMO geometries on individual Deep Neural Network layers. Derived in closed form from gradient and activation statistics using a single-step random feature regression surrogate model, our criterion navigates a design space interpolating from SGD to Muon updates. Moreover, integrating parameter-wise preconditioning allows our framework to recover SGD, Muon, Adam, and MuAdam as specific extrema. To make this adaptive approach scalable,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
