Insights on Muon from Simple Quadratics
Antoine Gonon, Andreea-Alexandra Mu\c{s}at, Nicolas Boumal

TL;DR
This paper investigates the empirical success of Muon optimization by analyzing its behavior on simple quadratic functions, revealing effects beyond traditional local and worst-case analyses.
Contribution
It uncovers how polar approximation errors and structural properties influence Muon's performance, challenging existing theoretical explanations.
Findings
Polar step approximation errors can improve finite-time performance.
Structural properties of objectives affect optimization constants.
Existing theories overlook these effects, requiring new explanations.
Abstract
Muon updates weight matrices along (approximate) polar factors of the gradients and has shown strong empirical performance in large-scale training. Existing attempts at explaining its performance largely focus on single-step comparisons (on quadratic proxies) and worst-case guarantees that treat the inexactness of the polar-factor as a nuisance ``to be argued away''. We show that already on simple strongly convex functions such as , these perspectives are insufficient, suggesting that understanding Muon requires going beyond local proxies and pessimistic worst-case bounds. Instead, our analysis exposes two observations that already affect behavior on simple quadratics and are not well captured by prevailing abstractions: (i) approximation error in the polar step can qualitatively alter discrete-time dynamics and improve reachability and finite-time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParticle physics theoretical and experimental studies · Muon and positron interactions and applications · Computational Physics and Python Applications
