Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws
Nandan Kumar Jha, Brandon Reagen

TL;DR
This paper demonstrates that different optimizers significantly influence the spectral scaling laws of Transformer models, affecting how effectively added capacity is utilized, independent of architecture and loss.
Contribution
It reveals that optimizer choice fundamentally alters spectral representation scaling, highlighting the importance of optimizer-architecture co-design in model development.
Findings
AdamW shows weak spectral-rank scaling in rare-token representations.
Muon achieves linear spectral-rank scaling, outperforming AdamW.
Optimizer effects on spectral geometry often surpass architectural influences.
Abstract
Scaling laws have made language-model performance predictable from model size, data, and compute, but they typically treat the optimizer as a fixed training detail. We show that this assumption misses a fundamental axis of representation scaling: how effectively the optimizer converts added FFN width into utilized spectral capacity. Using eigenspectra of feed-forward network representations, measured through soft and hard spectral-ranks, we find that \emph{the same Transformer architecture realizes markedly different spectral scaling laws when trained with different optimizers}. Holding architecture and width schedule fixed, AdamW exhibits weak hard-rank scaling (=0.44) on rare-token (TAIL) representations where learning is known to be hardest, whereas Muon achieves linear scaling (=1.02) in the same regimes, a increase in the scaling exponent. This difference…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
