Muown: Row-Norm Control for Muon Optimization
Kai Lion, Florian H\"ubler, Bingcong Li, Antonio Orvieto, Niao He

TL;DR
Muown is a novel optimizer for language model pre-training that improves upon Muon by explicitly controlling row-norms, leading to better perplexity and training stability across various model sizes.
Contribution
We introduce Muown, a new optimizer that explicitly manages row-norms in Muon, achieving optimal convergence rates and reducing spectral norm drift during training.
Findings
Muown outperforms Muon, SOAP, AdamW, and Lion in perplexity across multiple model sizes.
Muown widens the plateau of near-optimal learning rates, enhancing training robustness.
Muown reduces sensitivity to weight decay and avoids spectral norm drift with minimal overhead.
Abstract
Muon has emerged as a strong competitor to AdamW for language model pre-training, yet its behavior at scale is sensitive to weight decay. Recent work has observed that, for Muon without decoupled weight decay, the spectral norm of weight matrices drifts upward over training. Through a decomposition of the spectral norm into a row-magnitude factor and a row-coherence factor, we identify the former as the empirical driver of this drift under Muon, while the latter remains well-behaved along the trajectory. Motivated by this diagnosis, we introduce Muown, a drop-in replacement for Muon that treats the row-magnitude vector as an explicit optimizer variable, updating it under the geometry induced by the decomposition, while applying Muon unchanged to the remaining direction component. We prove that Muown attains the optimal non-convex rates in both deterministic and stochastic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
