MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training

Ruijie Zhang; Yequan Zhao; Ziyue Liu; Zhengyang Wang; Yupeng Su; Liyan Tan; Zheng Zhang

arXiv:2602.21545·cs.LG·May 15, 2026

MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training

Ruijie Zhang, Yequan Zhao, Ziyue Liu, Zhengyang Wang, Yupeng Su, Liyan Tan, Zheng Zhang

PDF

TL;DR

Muon+ introduces a simple normalization step after polar orthogonalization in Muon, effectively addressing imbalance issues and significantly improving large language model pre-training efficiency and performance.

Contribution

The paper identifies a post-polar imbalance problem in Muon and proposes Muon+, a minimal fix that enhances pre-training outcomes without additional optimizer state.

Findings

01

Muon+ outperforms Muon in training and validation perplexity.

02

Muon+ achieves significant pre-training speedup across various models.

03

The normalization step effectively mitigates imbalance issues in Muon.

Abstract

Muon has recently emerged as a strong optimizer for large language model pre-training, orthogonalizing the momentum matrix via Newton--Schulz polar iterations. A natural intuition is that polar iterations, by flattening the singular spectrum to all ones, should also eliminate column- and row-wise norm imbalance in the update. We show that this is not true in practice: practical polar steps can substantially amplify the imbalance. We term this the post-polar imbalanced update problem, and prove that such imbalance tightens the second-order term in a blockwise descent analysis, weakening Muon's per-step descent guarantee. Motivated by this analysis, we propose Muon+, a one-line fix that inserts a single normalization step after polar orthogonalization. Muon+ adds no optimizer state. Across pre-training experiments on GPT and LLaMA models from 60M to 7B parameters, spanning both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.