To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters

Sara Dragutinovi\'c; Rajesh Ranganath

arXiv:2603.00742·cs.LG·March 3, 2026

To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters

Sara Dragutinovi\'c, Rajesh Ranganath

PDF

Open Access

TL;DR

This paper analyzes the biases introduced by the Muon optimizer in deep learning, highlighting potential downsides such as reduced simplicity bias and increased risk of fitting spurious features, despite its training speed advantages.

Contribution

It provides a theoretical analysis of Muon's biases, illustrating how they affect learning trajectories and model solutions, and emphasizes the importance of considering biases in optimizer development.

Findings

01

Muon removes simplicity bias present in SGD

02

Muon may struggle to learn common structures across tasks

03

Muon is more prone to fitting spurious features

Abstract

For a long period of time, Adam has served as the ubiquitous default choice for training deep neural networks. Recently, many new optimizers have been introduced, out of which Muon has perhaps gained the highest popularity due to its superior training speed. While many papers set out to validate the benefits of Muon, our paper investigates the potential downsides stemming from the mechanism driving this speedup. We explore the biases induced when optimizing with Muon, providing theoretical analysis and its consequences to the learning trajectories and solutions learned. While the theory does provide justification for the benefits Muon brings, it also guides our intuition when coming up with a couple of examples where Muon-optimized models have disadvantages. The core problem we emphasize is that Muon optimization removes a simplicity bias that is naturally preserved by older, more…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMuon and positron interactions and applications · Computational Physics and Python Applications · Machine Learning and Data Classification