AMO: Adaptive Muon Orthogonalization
Xinlin Zhuang, Panyi Ouyang, Yichen Li, Jiangming Shi, Yizhang Chen, Shuman Liu, Ying Qian, Weiyang Liu, Haibo Zhang, Imran Razzak

TL;DR
AMO introduces a dynamic orthogonalization schedule for Muon that adapts to matrix geometry, improving large-scale pre-training performance over uniform approaches.
Contribution
It proposes Adaptive Muon Orthogonalization (AMO), which measures weight geometry early and allocates orthogonalization resources dynamically, outperforming uniform schedules.
Findings
AMO achieves +0.76 performance gain on Llama3.1-1.4B.
AMO surpasses baseline by +0.51 on Qwen3-1.7B.
Adaptive scheduling improves orthogonalization quality across training stages.
Abstract
Muon has recently emerged as a competitive alternative to AdamW for large-scale pre-training, with orthogonalization via Newton-Schulz (NS) iterations as its core operation. Existing Muon variants apply a uniform NS schedule to all parameter matrices, overlooking possible differences in orthogonalization difficulty and its impact on performance. Through a systematic empirical study, we show that this per-matrix heterogeneity is pervasive and largely determined by matrix geometry, which evolves dynamically across operator types, training stages, and network depths. As a result, uniform NS schedules can lead to uneven orthogonalization quality across the model. Motivated by these findings, we propose Adaptive Muon Orthogonalization (AMO), an observe-then-commit method that measures weight geometry by operator type early in training and then uses these signals to allocate the NS budget for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
