Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition
Sayantan Choudhury, Xiaoran Cheng, Martin Tak\'a\v{c}, Sen Na, Mladen Kolar

TL;DR
This paper develops a convergence theory for Muon optimizer with Nesterov momentum and inexact polar decomposition, addressing practical challenges like stochastic heavy-tailed noise and approximate computations.
Contribution
It introduces a unified framework for inexact polar decomposition, providing optimal complexity bounds and practical algorithms for non-convex matrix optimization.
Findings
Established convergence guarantees under heavy-tailed noise.
Proposed a randomized low-rank polar decomposition method.
Numerical experiments confirm the effectiveness of the proposed methods.
Abstract
Most first-order optimizers treat matrix-valued parameters as vectors, ignoring the intrinsic geometry of hidden-layer weights in neural networks. Muon addresses this mismatch by updating along the polar factor of a momentum matrix, but its theoretical understanding has lagged behind practice. In particular, practical implementations incorporate Nesterov momentum, compute the polar factor only approximately, and operate with stochastic gradients that may be heavy-tailed. We close this gap by developing a convergence theory for Muon with Nesterov momentum and inexact polar decomposition in non-convex matrix optimization under heavy-tailed noise. Our analysis builds on a unified framework for inexact polar decomposition that captures practical iterative approximations such as Newton-Schulz and quantifies how their errors propagate through the optimization dynamics. Under this framework,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
