Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory
Juno Kim, Eshaan Nichani, Denny Wu, Alberto Bietti, Jason D. Lee

TL;DR
This paper analyzes the capacity and efficiency of spectral optimizers like Muon in learning associative memory, revealing their superior recovery rates and scaling properties compared to SGD and Newton's method.
Contribution
It provides a theoretical characterization of spectral optimizer performance in a tractable associative memory model, highlighting their advantages over traditional methods.
Findings
Muon exceeds SGD in storage capacity and matches Newton's method using only first-order info.
Muon saturates at a larger critical batch size, enabling better scaling.
Experimental results validate the theoretical scaling laws and recovery rates.
Abstract
Spectral optimizers such as Muon have recently shown strong empirical performance in large-scale language model training, but the source and extent of their advantage remain poorly understood. We study this question through the linear associative memory problem, a tractable model for factual recall in transformer-based models. In particular, we go beyond orthogonal embeddings and consider Gaussian inputs and outputs, which allows the number of stored associations to greatly exceed the embedding dimension. Our main result sharply characterizes the recovery rates of one step of Muon, SGD, and Newton's method on the logistic regression loss under a power law frequency distribution. We show that the storage capacity of Muon significantly exceeds that of SGD, and even matches Newton's method while only using first-order information. Moreover, Muon saturates at a larger critical batch size.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
