Variance-Adaptive Muon: Accelerating LLM Pretraining with NSR-Modulated and Variance-Scaled Momentum

Jingru Li; Yibo Fan; Huan Li

arXiv:2601.14603·cs.LG·January 22, 2026

Variance-Adaptive Muon: Accelerating LLM Pretraining with NSR-Modulated and Variance-Scaled Momentum

Jingru Li, Yibo Fan, Huan Li

PDF

Open Access

TL;DR

This paper introduces variance-adaptive variants of Muon optimizer, Muon-NSR and Muon-VS, which improve LLM pretraining efficiency by accelerating convergence and reducing training iterations compared to AdamW and baseline Muon.

Contribution

The paper proposes two novel variance-adaptive momentum update methods, Muon-NSR and Muon-VS, enhancing optimizer efficiency for large language model pretraining.

Findings

01

Muon-NSR and Muon-VS accelerate convergence on GPT-2 and LLaMA.

02

They achieve lower validation loss than AdamW and Muon baselines.

03

On LLaMA-1.2B, they reduce training iterations by 1.36x.

Abstract

Large Language Models (LLMs) achieve competitive performance across diverse natural language processing (NLP) tasks, yet pretraining is computationally demanding, making optimizer efficiency an important practical consideration. Muon accelerates LLM pretraining via orthogonal momentum updates that serve as a matrix analogue of the element-wise sign operator. Motivated by the recent perspective that Adam is a variance-adaptive sign update algorithm, we propose two variants of Muon, Muon-NSR and Muon-VS, which apply variance-adaptive normalization to momentum before orthogonalization. Muon-NSR applies noise-to-signal ratio (NSR) modulation, while Muon-VS performs variance-based scaling without introducing additional hyperparameters. Experiments on GPT-2 and LLaMA pretraining demonstrate that our proposed methods accelerate convergence and consistently achieve lower validation loss than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational Physics and Python Applications · Machine Learning and Data Classification · Speech Recognition and Synthesis