Mousse: Rectifying the Geometry of Muon with Curvature-Aware Preconditioning
Yechen Zhang, Shuhao Xing, Junhao Huang, Kai Lv, Yunhua Zhou, Xipeng Qiu, Qipeng Guo, Kai Chen

TL;DR
Mousse is a new optimizer that improves spectral optimization for deep neural networks by adapting to the curvature spectrum, outperforming Muon with fewer training steps.
Contribution
We introduce Mousse, a curvature-aware optimizer that combines spectral methods with second-order preconditioning using Shampoo's structural estimation.
Findings
Mousse reduces training steps by approximately 12% compared to Muon.
Mousse demonstrates improved stability and convergence on language models from 160M to 800M parameters.
Empirical results show negligible additional computational overhead.
Abstract
Recent advances in spectral optimization, notably Muon, have demonstrated that constraining update steps to the Stiefel manifold can significantly accelerate training and improve generalization. However, Muon implicitly assumes an isotropic optimization landscape, enforcing a uniform spectral update norm across all eigen-directions. We argue that this "egalitarian" constraint is suboptimal for Deep Neural Networks, where the curvature spectrum is known to be highly heavy-tailed and ill-conditioned. In such landscapes, Muon risks amplifying instabilities in high-curvature directions while limiting necessary progress in flat directions. In this work, we propose \textbf{Mousse} (\textbf{M}uon \textbf{O}ptimization \textbf{U}tilizing \textbf{S}hampoo's \textbf{S}tructural \textbf{E}stimation), a novel optimizer that reconciles the structural stability of spectral methods with the geometric…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
