Convergence of Muon with Newton-Schulz

Gyu Yeol Kim; Min-hwan Oh

arXiv:2601.19156·stat.ML·January 28, 2026

Convergence of Muon with Newton-Schulz

Gyu Yeol Kim, Min-hwan Oh

PDF

Open Access 3 Reviews

TL;DR

This paper provides a theoretical analysis of Muon, showing that using Newton-Schulz for orthogonalization converges similarly to exact SVD methods, with faster practical performance and improved convergence properties.

Contribution

It proves that Muon with Newton-Schulz converges at the same rate as SVD-based polar factorization and explains the benefits of Newton-Schulz in practical orthogonalization.

Findings

01

Convergence rate of Muon with Newton-Schulz matches SVD-based polar factorization.

02

The constant factor in convergence converges doubly exponentially with the number of Newton-Schulz steps.

03

Muon removes the typical square-root-of-rank loss compared to vector-based optimizers.

Abstract

We analyze Muon as originally proposed and used in practice -- using the momentum orthogonalization with a few Newton-Schulz steps. The prior theoretical results replace this key step in Muon with an exact SVD-based polar factor. We prove that Muon with Newton-Schulz converges to a stationary point at the same rate as the SVD-polar idealization, up to a constant factor for a given number $q$ of Newton-Schulz steps. We further analyze this constant factor and prove that it converges to 1 doubly exponentially in $q$ and improves with the degree of the polynomial used in Newton-Schulz for approximating the orthogonalization direction. We also prove that Muon removes the typical square-root-of-rank loss compared to its vector-based counterpart, SGD with momentum. Our results explain why Muon with a few low-degree Newton-Schulz steps matches exact-polar (SVD) behavior at a much faster…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

This is one of the first convergence analysis result for Muon with finite steps of Newton-Schulz, which is a big step forward to narrow the gap between theory and practice.

Weaknesses

The experimental part is relatively too simple. Experiments with larger and various datasets, larger number of parameters in the model, various model types such as MLP and transformer, more epochs, and more specific analysis how the results validate the theoretical analysis, will be nice.

Reviewer 02Rating 8Confidence 3

Strengths

The paper addresses a very demanding question about the convergence speed of the recently proposed MUON algorithm with a finite step of NEWTON–SCHULZ. As an approximate version of the MUON with the exact SVD orthogonalization of the momentum update, the presented analysis provides a solid theoretical understanding of the convergence behavior of the MUON in connection to its exact SVD counterpart, and ultimately guarantees the properties of deploying the MUON algorithm with a finite step of NEWTO

Weaknesses

The core message conveyed in the paper was to show that the MUON algorithm with a finite step of NEWTON–SCHULZ converges similarly to some stationary point as its exact SVD counterpart in terms of convergence rate. However, one crucial aspect of training DNNs is the quality of stationary points. In other words, the discrepancy between stationary points generated by the MUONs with finite NEWTON–SCHULZ and the exact SVD can be large, thus is worth investigating.

Reviewer 03Rating 4Confidence 4

Strengths

This paper is very clear and generally very well written. The numerical results are extensive, but primarily in the appendix. I suggest moving several of the key plots from the appendix to the main text.

Weaknesses

I did find many redundant statements of the main results throughout the paper (often with exact wording…); while I generally agree that repetition like this can enhance clarity of interpretation and impact, I found this manuscript to be excessive in that regard. The numerical results are extensive, but primarily in the appendix. I suggest moving several of the key plots from the appendix to the main text. Also, from what I saw, there is little additional information provided by the presentation

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMuon and positron interactions and applications · Particle physics theoretical and experimental studies · Computational Physics and Python Applications