From SGD to Muon: Adaptive Optimization via Schatten-p Norms

Thomas Massena (IRIT; DTIPG - SNCF; UT3); Corentin Friedrich; Mathieu Serrurier (IRIT)

arXiv:2605.19781·cs.AI·May 20, 2026

From SGD to Muon: Adaptive Optimization via Schatten-p Norms

Thomas Massena (IRIT, DTIPG - SNCF, UT3), Corentin Friedrich, Mathieu Serrurier (IRIT)

PDF

TL;DR

This paper introduces an adaptive optimizer for deep neural networks that dynamically selects matrix geometries based on data, outperforming fixed-geometry optimizers like Muon and AdamW.

Contribution

The authors propose a novel data-driven criterion for dynamically choosing LMO geometries in optimizers, unifying various optimizers and improving performance with minimal overhead.

Findings

01

The adaptive optimizer outperforms or matches the best fixed-geometry optimizers across multiple scenarios.

02

The method achieves only about 3% additional runtime overhead.

03

It demonstrates the feasibility of data-driven geometry adaptation in optimization.

Abstract

Modern optimizers, like Muon, impose matrix-wise geometry constraints on their updates. These matrix-wise constraints can be unified under Linear Minimization Oracle (LMO) theory. However, all current methods impose fixed LMO geometries for the update rules, chosen by-design or empirically, which are not necessarily optimal according to the problem's geometry. We introduce a novel efficient datadriven criterion for dynamically choosing proxy-optimal update LMO geometries on individual Deep Neural Network layers. Derived in closed form from gradient and activation statistics using a single-step random feature regression surrogate model, our criterion navigates a design space interpolating from SGD to Muon updates. Moreover, integrating parameter-wise preconditioning allows our framework to recover SGD, Muon, Adam, and MuAdam as specific extrema. To make this adaptive approach scalable,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.