To Use or not to Use Muon: How Simplicity Bias in Optimizers Matters
Sara Dragutinovi\'c, Rajesh Ranganath

TL;DR
This paper analyzes the biases introduced by the Muon optimizer in deep learning, highlighting potential downsides such as reduced simplicity bias and increased risk of fitting spurious features, despite its training speed advantages.
Contribution
It provides a theoretical analysis of Muon's biases, illustrating how they affect learning trajectories and model solutions, and emphasizes the importance of considering biases in optimizer development.
Findings
Muon removes simplicity bias present in SGD
Muon may struggle to learn common structures across tasks
Muon is more prone to fitting spurious features
Abstract
For a long period of time, Adam has served as the ubiquitous default choice for training deep neural networks. Recently, many new optimizers have been introduced, out of which Muon has perhaps gained the highest popularity due to its superior training speed. While many papers set out to validate the benefits of Muon, our paper investigates the potential downsides stemming from the mechanism driving this speedup. We explore the biases induced when optimizing with Muon, providing theoretical analysis and its consequences to the learning trajectories and solutions learned. While the theory does provide justification for the benefits Muon brings, it also guides our intuition when coming up with a couple of examples where Muon-optimized models have disadvantages. The core problem we emphasize is that Muon optimization removes a simplicity bias that is naturally preserved by older, more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMuon and positron interactions and applications · Computational Physics and Python Applications · Machine Learning and Data Classification
