On the Convergence Analysis of Muon
Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, Jiawei Zhang

TL;DR
This paper provides a theoretical convergence analysis of Muon, an optimizer designed for matrix-structured neural network parameters, explaining its superior performance over traditional methods.
Contribution
It offers the first comprehensive convergence rate analysis of Muon and identifies conditions where it outperforms Gradient Descent, especially leveraging low-rank Hessian structures.
Findings
Muon can outperform Gradient Descent under certain conditions.
Theoretical results show Muon benefits from low-rank Hessian structures.
Experimental results support the theoretical convergence analysis.
Abstract
The majority of parameters in neural networks are naturally represented as matrices. However, most commonly used optimizers treat these matrix parameters as flattened vectors during optimization, potentially overlooking their inherent structural properties. Recently, an optimizer called Muon has been proposed, specifically designed to optimize matrix-structured parameters. Extensive empirical evidence shows that Muon can significantly outperform traditional optimizers when training neural networks. Nonetheless, the theoretical understanding of Muon's convergence behavior and the reasons behind its superior performance remain limited. In this work, we present a comprehensive convergence rate analysis of Muon and its comparison with Gradient Descent (GD). We characterize the conditions under which Muon can outperform GD. Our theoretical results reveal that Muon can benefit from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
