The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

Eitan Gronich; Gal Vardi

arXiv:2602.16340·cs.LG·March 4, 2026

The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

Eitan Gronich, Gal Vardi

PDF

Open Access

TL;DR

This paper investigates the implicit bias of momentum-based optimizers like Muon, Adam, and Signum on smooth homogeneous neural networks, showing they tend to maximize specific margin norms and extend previous theoretical results.

Contribution

It extends the analysis of implicit bias to momentum optimizers and Adam, demonstrating their tendency to maximize different margin norms in homogeneous models.

Findings

01

Momentum optimizers approximate steepest descent trajectories.

02

Different optimizers maximize different margin norms.

03

Experimental results support the theoretical analysis.

Abstract

We study the implicit bias of momentum-based optimizers on homogeneous models. We first extend existing results on the implicit bias of steepest descent in homogeneous models to normalized steepest descent with an optional learning rate schedule. We then show that for smooth homogeneous models, momentum steepest descent algorithms like Muon (spectral norm), MomentumGD ( $ℓ_{2}$ norm), and Signum ( $ℓ_{\infty}$ norm) are approximate steepest descent trajectories under a decaying learning rate schedule, proving that these algorithms too have a bias towards KKT points of the corresponding margin maximization problem. We extend the analysis to Adam (without the stability constant), which maximizes the $ℓ_{\infty}$ margin, and to Muon-Signum and Muon-Adam, which maximize a hybrid norm. Our experiments corroborate the theory and show that the identity of the margin maximized depends on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Computational Physics and Python Applications · Machine Learning in Materials Science