Phases of Muon: When Muon Eclipses SignSGD
Elliot Paquette, Noah Marshall, Lucas Benigni, Guangyuan Wang, Atish Agarwala, Courtney Paquette

TL;DR
This paper analyzes the behavior of spectral optimizers Muon and SignSGD, revealing their different convergence properties and phase-dependent performance on high-dimensional least squares problems.
Contribution
It provides explicit deterministic dynamics for Muon and SignSGD, clarifying their preconditioning effects and phase transitions based on data covariance structure.
Findings
SignSVD performs square-root preconditioning at large batch sizes.
Small batch sizes cause smaller eigenmodes to behave like SGD, slowing convergence.
Three phases exist in data covariance space where either SignSGD or SignSVD is favored.
Abstract
Recently, Muon and related spectral optimizers have demonstrated strong empirical performance as scalable stochastic methods, often outperforming Adam. Yet their behaviour remains poorly understood. We analyze stochastic spectral optimizers, including Muon, on a high-dimensional matrix-valued least squares problem. We derive explicit deterministic dynamics that provide a tractable framework for studying learning behaviour with a focus on (stochastic) SignSVD, which Muon approximates, and (stochastic) SignSGD, the latter serving as a proxy for Adam. Our analysis shows that for large batch size, SignSVD performs a square-root preconditioning with respect to the data covariance spectrum, while for small batch size smaller eigenmodes behave like SGD, slowing down convergence. We contrast with SignSGD which for generic covariance performs no preconditioning and has no transition, leading to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
