Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning

Ziyue Liu; Ruijie Zhang; Zhengyang Wang; Yequan Zhao; Yupeng Su; Zi Yang; Zheng Zhang

arXiv:2604.09967·cs.LG·April 14, 2026

Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning

Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Yequan Zhao, Yupeng Su, Zi Yang, Zheng Zhang

PDF

TL;DR

Muon$^2$ enhances the Muon optimizer by integrating adaptive second-moment preconditioning, significantly improving convergence speed and efficiency in large-scale model pre-training.

Contribution

It introduces Muon$^2$, a novel extension applying Adam-style preconditioning to accelerate Muon, reducing iterations and improving orthogonalization in large-scale neural network training.

Findings

01

Muon$^2$ reduces Newton--Schulz iterations by 40% in experiments.

02

Muon$^2$ outperforms Muon and variants in GPT and LLaMA pre-training.

03

Muon$^2$-F maintains gains with negligible memory overhead.

Abstract

Muon has emerged as a promising optimizer for large-scale foundation model pre-training by exploiting the matrix structure of neural network updates through iterative orthogonalization. However, its practical efficiency is limited by the need for multiple Newton--Schulz (NS) iterations per optimization step, which introduces non-trivial computation and communication overhead. We propose Muon $^{2}$ , an extension of Muon that applies Adam-style adaptive second-moment preconditioning before orthogonalization. Our key insight is that the core challenge of polar approximation in Muon lies in the ill-conditioned momentum matrix, of which the spectrum is substantially improved by Muon $^{2}$ , leading to faster convergence toward a practically sufficient orthogonalization. We further characterize the practical orthogonalization quality via directional alignment, under which Muon $^{2}$ demonstrates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.