Muon in Associative Memory Learning: Training Dynamics and Scaling Laws

Binghui Li; Kaifei Wang; Han Zhong; Pinyan Lu; Liwei Wang

arXiv:2602.05725·cs.LG·February 6, 2026

Muon in Associative Memory Learning: Training Dynamics and Scaling Laws

Binghui Li, Kaifei Wang, Han Zhong, Pinyan Lu, Liwei Wang

PDF

Open Access

TL;DR

This paper analyzes Muon, an optimizer that updates matrix parameters using the matrix sign of the gradient, demonstrating its faster convergence and better scaling laws compared to Gradient Descent in associative memory models.

Contribution

The paper provides a theoretical analysis of Muon in associative memory, revealing its dynamics, scaling laws, and interpretation as an implicit preconditioner, supported by empirical experiments.

Findings

01

Muon achieves exponential speedup over GD in noiseless settings.

02

Muon mitigates frequency imbalance, leading to faster convergence.

03

Experimental results validate theoretical predictions on synthetic and pre-training tasks.

Abstract

Muon updates matrix parameters via the matrix sign of the gradient and has shown strong empirical gains, yet its dynamics and scaling behavior remain unclear in theory. We study Muon in a linear associative memory model with softmax retrieval and a hierarchical frequency spectrum over query-answer pairs, with and without label noise. In this setting, we show that Gradient Descent (GD) learns frequency components at highly imbalanced rates, leading to slow convergence bottlenecked by low-frequency components. In contrast, the Muon optimizer mitigates this imbalance, leading to faster and more uniform progress. Specifically, in the noiseless case, Muon achieves an exponential speedup over GD; in the noisy case with a power-decay frequency spectrum, we derive Muon's optimization scaling law and demonstrate its superior scaling efficiency over GD. Furthermore, we show that Muon can be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Ferroelectric and Negative Capacitance Devices · Stochastic Gradient Optimization Techniques