The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks
Eitan Gronich, Gal Vardi

TL;DR
This paper investigates the implicit bias of momentum-based optimizers like Muon, Adam, and Signum on smooth homogeneous neural networks, showing they tend to maximize specific margin norms and extend previous theoretical results.
Contribution
It extends the analysis of implicit bias to momentum optimizers and Adam, demonstrating their tendency to maximize different margin norms in homogeneous models.
Findings
Momentum optimizers approximate steepest descent trajectories.
Different optimizers maximize different margin norms.
Experimental results support the theoretical analysis.
Abstract
We study the implicit bias of momentum-based optimizers on homogeneous models. We first extend existing results on the implicit bias of steepest descent in homogeneous models to normalized steepest descent with an optional learning rate schedule. We then show that for smooth homogeneous models, momentum steepest descent algorithms like Muon (spectral norm), MomentumGD ( norm), and Signum ( norm) are approximate steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms too have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the margin, and to Muon-Signum and Muon-Adam, which maximize a hybrid norm. Our experiments corroborate the theory and show that the identity of the margin maximized depends on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Computational Physics and Python Applications · Machine Learning in Materials Science
