Variance-Adaptive Muon: Accelerating LLM Pretraining with NSR-Modulated and Variance-Scaled Momentum
Jingru Li, Yibo Fan, Huan Li

TL;DR
This paper introduces variance-adaptive variants of Muon optimizer, Muon-NSR and Muon-VS, which improve LLM pretraining efficiency by accelerating convergence and reducing training iterations compared to AdamW and baseline Muon.
Contribution
The paper proposes two novel variance-adaptive momentum update methods, Muon-NSR and Muon-VS, enhancing optimizer efficiency for large language model pretraining.
Findings
Muon-NSR and Muon-VS accelerate convergence on GPT-2 and LLaMA.
They achieve lower validation loss than AdamW and Muon baselines.
On LLaMA-1.2B, they reduce training iterations by 1.36x.
Abstract
Large Language Models (LLMs) achieve competitive performance across diverse natural language processing (NLP) tasks, yet pretraining is computationally demanding, making optimizer efficiency an important practical consideration. Muon accelerates LLM pretraining via orthogonal momentum updates that serve as a matrix analogue of the element-wise sign operator. Motivated by the recent perspective that Adam is a variance-adaptive sign update algorithm, we propose two variants of Muon, Muon-NSR and Muon-VS, which apply variance-adaptive normalization to momentum before orthogonalization. Muon-NSR applies noise-to-signal ratio (NSR) modulation, while Muon-VS performs variance-based scaling without introducing additional hyperparameters. Experiments on GPT-2 and LLaMA pretraining demonstrate that our proposed methods accelerate convergence and consistently achieve lower validation loss than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Physics and Python Applications · Machine Learning and Data Classification · Speech Recognition and Synthesis
