Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

Nandan Kumar Jha; Brandon Reagen

arXiv:2605.21803·cs.LG·May 22, 2026

Same Architecture, Different Capacity: Optimizer-Induced Spectral Scaling Laws

Nandan Kumar Jha, Brandon Reagen

PDF

TL;DR

This paper demonstrates that different optimizers significantly influence the spectral scaling laws of Transformer models, affecting how effectively added capacity is utilized, independent of architecture and loss.

Contribution

It reveals that optimizer choice fundamentally alters spectral representation scaling, highlighting the importance of optimizer-architecture co-design in model development.

Findings

01

AdamW shows weak spectral-rank scaling in rare-token representations.

02

Muon achieves linear spectral-rank scaling, outperforming AdamW.

03

Optimizer effects on spectral geometry often surpass architectural influences.

Abstract

Scaling laws have made language-model performance predictable from model size, data, and compute, but they typically treat the optimizer as a fixed training detail. We show that this assumption misses a fundamental axis of representation scaling: how effectively the optimizer converts added FFN width into utilized spectral capacity. Using eigenspectra of feed-forward network representations, measured through soft and hard spectral-ranks, we find that \emph{the same Transformer architecture realizes markedly different spectral scaling laws when trained with different optimizers}. Holding architecture and width schedule fixed, AdamW exhibits weak hard-rank scaling ( $β$ =0.44) on rare-token (TAIL) representations where learning is known to be hardest, whereas Muon achieves linear scaling ( $β$ =1.02) in the same regimes, a $2.3 \times$ increase in the scaling exponent. This difference…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.