Clarifying Shampoo: Adapting Spectral Descent to Stochasticity and the Parameter Trajectory

Runa Eschenhagen; Anna Cai; Tsung-Hsien Lee; Hao-Jun Michael Shi

arXiv:2602.09314·cs.LG·February 11, 2026

Clarifying Shampoo: Adapting Spectral Descent to Stochasticity and the Parameter Trajectory

Runa Eschenhagen, Anna Cai, Tsung-Hsien Lee, Hao-Jun Michael Shi

PDF

Open Access

TL;DR

This paper investigates the relationships between spectral descent-based optimizers like Shampoo and Muon, revealing that Shampoo's advantages stem from its application to weight matrices and proposing a new perspective on its update mechanism.

Contribution

It establishes a connection between Shampoo and Muon, demonstrating Shampoo's superior token efficiency and clarifying its update dynamics through spectral descent analysis.

Findings

01

Shampoo outperforms Muon in token efficiency on language models.

02

Shampoo's updates can be decomposed into an adapted Muon update.

03

Shampoo's benefits are due to its application to weight matrices, not shape-agnostic interpretations.

Abstract

Optimizers leveraging the matrix structure in neural networks, such as Shampoo and Muon, are more data-efficient than element-wise algorithms like Adam and Signum. While in specific settings, Shampoo and Muon reduce to spectral descent analogous to how Adam and Signum reduce to sign descent, their general relationship and relative data efficiency under controlled settings remain unclear. Through extensive experiments on language models, we demonstrate that Shampoo achieves higher token efficiency than Muon, mirroring Adam's advantage over Signum. We show that Shampoo's update applied to weight matrices can be decomposed into an adapted Muon update. Consistent with this, Shampoo's benefits can be exclusively attributed to its application to weight matrices, challenging interpretations agnostic to parameter shapes. This admits a new perspective that also avoids shortcomings of related…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Computational Physics and Python Applications · Big Data and Digital Economy