How Muon's Spectral Design Benefits Generalization: A Study on Imbalanced Data
Bhavya Vasudeva, Puneesh Deora, Yize Zhao, Vatsal Sharan, Christos Thrampoulidis

TL;DR
This paper investigates how spectral gradient descent, exemplified by Muon, enhances generalization on imbalanced data by learning all data components evenly, outperforming traditional methods like GD and Adam.
Contribution
It introduces a simplified spectral optimization framework and provides theoretical and empirical evidence of its advantages over Euclidean gradient descent on imbalanced datasets.
Findings
Spectral gradient descent learns all principal components at equal rates.
Spectral methods outperform Euclidean GD early in training on imbalanced data.
Depth amplifies the benefits of spectral optimization.
Abstract
The growing adoption of spectrum-aware matrix-valued optimizers such as Muon and Shampoo in deep learning motivates a systematic study of their generalization properties and, in particular, when they might outperform competitive algorithms. We approach this question by introducing appropriate simplifying abstractions as follows: First, we use imbalanced data as a testbed. Second, we study the canonical form of such optimizers, which is Spectral Gradient Descent (SpecGD) -- each update step is where is the truncated SVD of the gradient. Third, within this framework we identify a canonical setting for which we precisely quantify when SpecGD outperforms vanilla Euclidean GD. For a Gaussian mixture data model and both linear and bilinear models, we show that unlike GD, which prioritizes learning dominant principal components of the data first, SpecGD learns all…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
