Convergence of Muon with Newton-Schulz
Gyu Yeol Kim, Min-hwan Oh

TL;DR
This paper provides a theoretical analysis of Muon, showing that using Newton-Schulz for orthogonalization converges similarly to exact SVD methods, with faster practical performance and improved convergence properties.
Contribution
It proves that Muon with Newton-Schulz converges at the same rate as SVD-based polar factorization and explains the benefits of Newton-Schulz in practical orthogonalization.
Findings
Convergence rate of Muon with Newton-Schulz matches SVD-based polar factorization.
The constant factor in convergence converges doubly exponentially with the number of Newton-Schulz steps.
Muon removes the typical square-root-of-rank loss compared to vector-based optimizers.
Abstract
We analyze Muon as originally proposed and used in practice -- using the momentum orthogonalization with a few Newton-Schulz steps. The prior theoretical results replace this key step in Muon with an exact SVD-based polar factor. We prove that Muon with Newton-Schulz converges to a stationary point at the same rate as the SVD-polar idealization, up to a constant factor for a given number of Newton-Schulz steps. We further analyze this constant factor and prove that it converges to 1 doubly exponentially in and improves with the degree of the polynomial used in Newton-Schulz for approximating the orthogonalization direction. We also prove that Muon removes the typical square-root-of-rank loss compared to its vector-based counterpart, SGD with momentum. Our results explain why Muon with a few low-degree Newton-Schulz steps matches exact-polar (SVD) behavior at a much faster…
Peer Reviews
Decision·ICLR 2026 Poster
This is one of the first convergence analysis result for Muon with finite steps of Newton-Schulz, which is a big step forward to narrow the gap between theory and practice.
The experimental part is relatively too simple. Experiments with larger and various datasets, larger number of parameters in the model, various model types such as MLP and transformer, more epochs, and more specific analysis how the results validate the theoretical analysis, will be nice.
The paper addresses a very demanding question about the convergence speed of the recently proposed MUON algorithm with a finite step of NEWTON–SCHULZ. As an approximate version of the MUON with the exact SVD orthogonalization of the momentum update, the presented analysis provides a solid theoretical understanding of the convergence behavior of the MUON in connection to its exact SVD counterpart, and ultimately guarantees the properties of deploying the MUON algorithm with a finite step of NEWTO
The core message conveyed in the paper was to show that the MUON algorithm with a finite step of NEWTON–SCHULZ converges similarly to some stationary point as its exact SVD counterpart in terms of convergence rate. However, one crucial aspect of training DNNs is the quality of stationary points. In other words, the discrepancy between stationary points generated by the MUONs with finite NEWTON–SCHULZ and the exact SVD can be large, thus is worth investigating.
This paper is very clear and generally very well written. The numerical results are extensive, but primarily in the appendix. I suggest moving several of the key plots from the appendix to the main text.
I did find many redundant statements of the main results throughout the paper (often with exact wording…); while I generally agree that repetition like this can enhance clarity of interpretation and impact, I found this manuscript to be excessive in that regard. The numerical results are extensive, but primarily in the appendix. I suggest moving several of the key plots from the appendix to the main text. Also, from what I saw, there is little additional information provided by the presentation
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMuon and positron interactions and applications · Particle physics theoretical and experimental studies · Computational Physics and Python Applications
