Practical Efficiency of Muon for Pretraining

Essential AI: Ishaan Shah; Anthony M. Polloreno; Karl Stratos; Philip Monk; Adarsh Chaluvaraju; Andrew Hojel; Andrew Ma; Anil Thomas; Ashish Tanwer; Darsh J Shah; Khoi Nguyen; Kurt Smith; Michael Callahan; Michael Pust; Mohit Parmar; Peter Rushton; Platon Mazarakis; Ritvik Kapila; Saurabh Srivastava; Somanshu Singla; Tim Romanski; Yash Vanjani; Ashish Vaswani

arXiv:2505.02222·cs.LG·May 21, 2025

Practical Efficiency of Muon for Pretraining

Essential AI: Ishaan Shah, Anthony M. Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, Khoi Nguyen, Kurt Smith, Michael Callahan, Michael Pust, Mohit Parmar, Peter Rushton, Platon Mazarakis

PDF

Open Access 1 Repo

TL;DR

This paper shows that Muon, a second-order optimizer, improves training efficiency and data retention at large batch sizes, enabling more economical training for large models.

Contribution

It introduces Muon as an effective optimizer that expands the Pareto frontier over AdamW and combines it with muP for efficient hyperparameter transfer.

Findings

01

Muon outperforms AdamW in data efficiency at large batch sizes.

02

The combination of Muon and muP enables efficient hyperparameter transfer.

03

Validated on models up to four billion parameters.

Abstract

We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study the combination of Muon and the maximal update parameterization (muP) for efficient hyperparameter transfer and present a simple telescoping algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources. We validate our findings through extensive experiments with model sizes up to four billion parameters and ablations on the data distribution and architecture.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

KellerJordan/Muon
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational Physics and Python Applications · Machine Learning and Data Classification · Stochastic Gradient Optimization Techniques

MethodsAdamW