Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization

Ruotong Sun; Ermin Wei

arXiv:2605.06316·cs.LG·May 8, 2026

Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization

Ruotong Sun, Ermin Wei

PDF

TL;DR

Pro-KLShampoo introduces a structured approach combining KL-Shampoo and orthogonalization, exploiting spectral properties to improve large language model pre-training efficiency and effectiveness.

Contribution

It reveals the eigenvalue spectrum structure of KL-Shampoo's preconditioners and leverages this to develop a hybrid optimizer with better performance.

Findings

01

Pro-KLShampoo outperforms KL-Shampoo in validation loss across multiple models.

02

It reduces memory usage and training time compared to standard KL-Shampoo.

03

The method is effective across different model scales and training stages.

Abstract

Optimizers that exploit the matrix structure of gradients are central to modern LLM pre-training, with two distinct frontiers: explicit Kronecker-factored preconditioning -- most recently KL-Shampoo, which estimates the preconditioner via KL divergence minimization -- and orthogonalization of the gradient momentum, exemplified by Muon and analyzed as steepest descent under the spectral norm. The two routes are typically developed in isolation. We make a structural observation about KL-Shampoo's Kronecker preconditioners: their eigenvalue spectra exhibit a \emph{spike-and-flat} shape -- a few dominant eigenvalues followed by an approximately uniform tail -- across layers and training stages, holding exactly under a rank- $ρ$ signal-plus-noise gradient model. We exploit this structure by restricting one of KL-Shampoo's Kronecker factors to a parametric family aligned with the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.