Convergence Bound and Critical Batch Size of Muon Optimizer
Naoki Sato, Hiroki Naganuma, and Hideaki Iiduka

TL;DR
This paper provides a theoretical convergence analysis of the Muon optimizer, revealing its advantages with weight decay, and determines the optimal batch size to minimize training costs, validated across various tasks.
Contribution
It offers the first convergence proofs for Muon, analyzes the effects of weight decay and momentum, and derives the critical batch size for efficient training.
Findings
Weight decay leads to tighter convergence bounds.
The critical batch size for Muon is identified and validated.
Experimental results confirm theoretical predictions across tasks.
Abstract
Muon, a recently proposed optimizer that leverages the inherent matrix structure of neural network parameters, has demonstrated strong empirical performance, indicating its potential as a successor to standard optimizers such as AdamW. This paper presents theoretical analysis to support its practical success. We provide convergence proofs for Muon across four practical settings, systematically examining its behavior with and without the inclusion of Nesterov momentum and weight decay. Our analysis covers the standard configuration using both, thereby elucidating its real-world performance. We then demonstrate that the addition of weight decay yields strictly tighter theoretical bounds and clarify the interplay between the weight decay coefficient and the learning rate. Finally, we derive the critical batch size for Muon that minimizes the computational cost of training. Our analysis…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications · Muon and positron interactions and applications · Particle physics theoretical and experimental studies
