Muon in Associative Memory Learning: Training Dynamics and Scaling Laws
Binghui Li, Kaifei Wang, Han Zhong, Pinyan Lu, Liwei Wang

TL;DR
This paper analyzes Muon, an optimizer that updates matrix parameters using the matrix sign of the gradient, demonstrating its faster convergence and better scaling laws compared to Gradient Descent in associative memory models.
Contribution
The paper provides a theoretical analysis of Muon in associative memory, revealing its dynamics, scaling laws, and interpretation as an implicit preconditioner, supported by empirical experiments.
Findings
Muon achieves exponential speedup over GD in noiseless settings.
Muon mitigates frequency imbalance, leading to faster convergence.
Experimental results validate theoretical predictions on synthetic and pre-training tasks.
Abstract
Muon updates matrix parameters via the matrix sign of the gradient and has shown strong empirical gains, yet its dynamics and scaling behavior remain unclear in theory. We study Muon in a linear associative memory model with softmax retrieval and a hierarchical frequency spectrum over query-answer pairs, with and without label noise. In this setting, we show that Gradient Descent (GD) learns frequency components at highly imbalanced rates, leading to slow convergence bottlenecked by low-frequency components. In contrast, the Muon optimizer mitigates this imbalance, leading to faster and more uniform progress. Specifically, in the noiseless case, Muon achieves an exponential speedup over GD; in the noisy case with a power-decay frequency spectrum, we derive Muon's optimization scaling law and demonstrate its superior scaling efficiency over GD. Furthermore, we show that Muon can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Materials Science · Ferroelectric and Negative Capacitance Devices · Stochastic Gradient Optimization Techniques
