Muon is Not That Special: Random or Inverted Spectra Work Just as Well
Zakhar Shumaylov, Natha\"el Da Costa, Peter Zaika, B\'alint Mucs\'anyi, Alex Massucco, Yoav Gelberg, Carola-Bibiane Sch\"onlieb, Yarin Gal, Philipp Hennig

TL;DR
This paper challenges the importance of geometric structure in optimization algorithms, showing that performance can be achieved through randomness and local quantities like alignment and descent potential.
Contribution
It introduces Freon and Kaon optimizers, demonstrating that geometric structure is not essential for optimization success, and highlights the role of local quantities in step-size tuning.
Findings
Freon interpolates between SGD and Muon, working well in quasi-norm regimes.
Kaon, with random noise, matches Muon's performance despite lacking geometric structure.
Performance is governed by alignment and descent potential, not strict geometric properties.
Abstract
The recent empirical success of the Muon optimizer has renewed interest in non-Euclidean optimization, typically justified by similarities with second-order methods, and linear minimization oracle (LMO) theory. In this paper, we challenge this geometric narrative through three contributions, demonstrating that precise geometric structure is not the key factor affecting optimization performance. First, we introduce Freon, a family of optimizers based on Schatten (quasi-)norms, powered by a novel, provably optimal QDWH-based iterative approximation. Freon naturally interpolates between SGD and Muon, while smoothly extrapolating into the quasi-norm regime. Empirically, the best-performing Schatten parameters for GPT-2 lie strictly within the quasi-norm regime, and thus cannot be represented by any unitarily invariant LMO. Second, noting that Freon performs well across a wide range of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
