Muown: Row-Norm Control for Muon Optimization

Kai Lion; Florian H\"ubler; Bingcong Li; Antonio Orvieto; Niao He

arXiv:2605.10797·cs.LG·May 12, 2026

Muown: Row-Norm Control for Muon Optimization

Kai Lion, Florian H\"ubler, Bingcong Li, Antonio Orvieto, Niao He

PDF

TL;DR

Muown is a novel optimizer for language model pre-training that improves upon Muon by explicitly controlling row-norms, leading to better perplexity and training stability across various model sizes.

Contribution

We introduce Muown, a new optimizer that explicitly manages row-norms in Muon, achieving optimal convergence rates and reducing spectral norm drift during training.

Findings

01

Muown outperforms Muon, SOAP, AdamW, and Lion in perplexity across multiple model sizes.

02

Muown widens the plateau of near-optimal learning rates, enhancing training robustness.

03

Muown reduces sensitivity to weight decay and avoids spectral norm drift with minimal overhead.

Abstract

Muon has emerged as a strong competitor to AdamW for language model pre-training, yet its behavior at scale is sensitive to weight decay. Recent work has observed that, for Muon without decoupled weight decay, the spectral norm of weight matrices drifts upward over training. Through a decomposition of the spectral norm into a row-magnitude factor and a row-coherence factor, we identify the former as the empirical driver of this drift under Muon, while the latter remains well-behaved along the trajectory. Motivated by this diagnosis, we introduce Muown, a drop-in replacement for Muon that treats the row-magnitude vector as an explicit optimizer variable, updating it under the $ℓ_{\infty}$ geometry induced by the decomposition, while applying Muon unchanged to the remaining direction component. We prove that Muown attains the optimal non-convex rates in both deterministic and stochastic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.