Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

Juno Kim; Eshaan Nichani; Denny Wu; Alberto Bietti; Jason D. Lee

arXiv:2603.26554·cs.LG·April 29, 2026

Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

Juno Kim, Eshaan Nichani, Denny Wu, Alberto Bietti, Jason D. Lee

PDF

TL;DR

This paper analyzes the capacity and efficiency of spectral optimizers like Muon in learning associative memory, revealing their superior recovery rates and scaling properties compared to SGD and Newton's method.

Contribution

It provides a theoretical characterization of spectral optimizer performance in a tractable associative memory model, highlighting their advantages over traditional methods.

Findings

01

Muon exceeds SGD in storage capacity and matches Newton's method using only first-order info.

02

Muon saturates at a larger critical batch size, enabling better scaling.

03

Experimental results validate the theoretical scaling laws and recovery rates.

Abstract

Spectral optimizers such as Muon have recently shown strong empirical performance in large-scale language model training, but the source and extent of their advantage remain poorly understood. We study this question through the linear associative memory problem, a tractable model for factual recall in transformer-based models. In particular, we go beyond orthogonal embeddings and consider Gaussian inputs and outputs, which allows the number of stored associations to greatly exceed the embedding dimension. Our main result sharply characterizes the recovery rates of one step of Muon, SGD, and Newton's method on the logistic regression loss under a power law frequency distribution. We show that the storage capacity of Muon significantly exceeds that of SGD, and even matches Newton's method while only using first-order information. Moreover, Muon saturates at a larger critical batch size.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.