AMO: Adaptive Muon Orthogonalization

Xinlin Zhuang; Panyi Ouyang; Yichen Li; Jiangming Shi; Yizhang Chen; Shuman Liu; Ying Qian; Weiyang Liu; Haibo Zhang; Imran Razzak

arXiv:2605.17806·cs.LG·May 19, 2026

AMO: Adaptive Muon Orthogonalization

Xinlin Zhuang, Panyi Ouyang, Yichen Li, Jiangming Shi, Yizhang Chen, Shuman Liu, Ying Qian, Weiyang Liu, Haibo Zhang, Imran Razzak

PDF

TL;DR

AMO introduces a dynamic orthogonalization schedule for Muon that adapts to matrix geometry, improving large-scale pre-training performance over uniform approaches.

Contribution

It proposes Adaptive Muon Orthogonalization (AMO), which measures weight geometry early and allocates orthogonalization resources dynamically, outperforming uniform schedules.

Findings

01

AMO achieves +0.76 performance gain on Llama3.1-1.4B.

02

AMO surpasses baseline by +0.51 on Qwen3-1.7B.

03

Adaptive scheduling improves orthogonalization quality across training stages.

Abstract

Muon has recently emerged as a competitive alternative to AdamW for large-scale pre-training, with orthogonalization via Newton-Schulz (NS) iterations as its core operation. Existing Muon variants apply a uniform NS schedule to all parameter matrices, overlooking possible differences in orthogonalization difficulty and its impact on performance. Through a systematic empirical study, we show that this per-matrix heterogeneity is pervasive and largely determined by matrix geometry, which evolves dynamically across operator types, training stages, and network depths. As a result, uniform NS schedules can lead to uneven orthogonalization quality across the model. Motivated by these findings, we propose Adaptive Muon Orthogonalization (AMO), an observe-then-commit method that measures weight geometry by operator type early in training and then uses these signals to allocate the NS budget for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.