A Note on the Convergence of Muon

Jiaxiang Li; Mingyi Hong

arXiv:2502.02900·math.OC·June 3, 2025

A Note on the Convergence of Muon

Jiaxiang Li, Mingyi Hong

PDF

Open Access

TL;DR

This paper analyzes the convergence properties of the Muon optimizer, a new method for pretraining large language models, highlighting its theoretical foundations and potential implications.

Contribution

It introduces the Muon optimizer, a novel optimization technique for LLM pretraining, and provides a detailed convergence analysis of its variants.

Findings

01

Convergence of the Muon optimizer is established theoretically.

02

The optimizer's update rule is based on spectral norm minimization.

03

Implications for large language model training efficiency are discussed.

Abstract

In this note, we inspect the convergence of a new optimizer for pretraining LLMs, namely the Muon optimizer. Such an optimizer is closely related to a specialized steepest descent method where the update direction is the minimizer of the quadratic approximation of the objective function under spectral norm. We provide the convergence analysis on both versions of the optimizer and discuss its implications.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParticle physics theoretical and experimental studies · High-Energy Particle Collisions Research