A Note on the Convergence of Muon
Jiaxiang Li, Mingyi Hong

TL;DR
This paper analyzes the convergence properties of the Muon optimizer, a new method for pretraining large language models, highlighting its theoretical foundations and potential implications.
Contribution
It introduces the Muon optimizer, a novel optimization technique for LLM pretraining, and provides a detailed convergence analysis of its variants.
Findings
Convergence of the Muon optimizer is established theoretically.
The optimizer's update rule is based on spectral norm minimization.
Implications for large language model training efficiency are discussed.
Abstract
In this note, we inspect the convergence of a new optimizer for pretraining LLMs, namely the Muon optimizer. Such an optimizer is closely related to a specialized steepest descent method where the update direction is the minimizer of the quadratic approximation of the objective function under spectral norm. We provide the convergence analysis on both versions of the optimizer and discuss its implications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParticle physics theoretical and experimental studies · High-Energy Particle Collisions Research
