
TL;DR
This paper introduces a spectral Wasserstein framework for analyzing mean-field normalized training dynamics in deep learning, connecting matrix norms with gradient flow interpretations.
Contribution
It develops a unified spectral Wasserstein distance framework, extending classical optimal transport to matrix norms and linking it to mean-field normalized training dynamics.
Findings
Spectral Wasserstein distances interpolate between classical $W_2$ and Muon geometries.
Established a gradient-flow interpretation of normalized training dynamics.
Numerical experiments demonstrate the framework's applicability to various models.
Abstract
Gradient normalization stabilizes deep-learning optimization, and spectral normalizations are especially natural for matrix-shaped parameter blocks; Muon is the motivating example. We study an idealized deterministic, continuous-time, vanishing-momentum version of this idea in the mean-field regime, where wide models are represented by probability measures on parameter space. Starting from normalized matrix flows, we introduce Spectral Wasserstein distances indexed by norms on positive semidefinite matrices: the trace norm gives classical , the operator norm gives the Muon geometry, and Schatten norms interpolate between them. We develop the static Kantorovich formulation, a max-min robust-cost representation, Gaussian reductions extending the Bures formula, and for monotone norms, prove equivalence with a Benamou--Brenier formulation. This yields a gradient-flow…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
