Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning
Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Yequan Zhao, Yupeng Su, Zi Yang, Zheng Zhang

TL;DR
Muon$^2$ enhances the Muon optimizer by integrating adaptive second-moment preconditioning, significantly improving convergence speed and efficiency in large-scale model pre-training.
Contribution
It introduces Muon$^2$, a novel extension applying Adam-style preconditioning to accelerate Muon, reducing iterations and improving orthogonalization in large-scale neural network training.
Findings
Muon$^2$ reduces Newton--Schulz iterations by 40% in experiments.
Muon$^2$ outperforms Muon and variants in GPT and LLaMA pre-training.
Muon$^2$-F maintains gains with negligible memory overhead.
Abstract
Muon has emerged as a promising optimizer for large-scale foundation model pre-training by exploiting the matrix structure of neural network updates through iterative orthogonalization. However, its practical efficiency is limited by the need for multiple Newton--Schulz (NS) iterations per optimization step, which introduces non-trivial computation and communication overhead. We propose Muon, an extension of Muon that applies Adam-style adaptive second-moment preconditioning before orthogonalization. Our key insight is that the core challenge of polar approximation in Muon lies in the ill-conditioned momentum matrix, of which the spectrum is substantially improved by Muon, leading to faster convergence toward a practically sufficient orthogonalization. We further characterize the practical orthogonalization quality via directional alignment, under which Muon demonstrates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
