Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization
Ruotong Sun, Ermin Wei

TL;DR
Pro-KLShampoo introduces a structured approach combining KL-Shampoo and orthogonalization, exploiting spectral properties to improve large language model pre-training efficiency and effectiveness.
Contribution
It reveals the eigenvalue spectrum structure of KL-Shampoo's preconditioners and leverages this to develop a hybrid optimizer with better performance.
Findings
Pro-KLShampoo outperforms KL-Shampoo in validation loss across multiple models.
It reduces memory usage and training time compared to standard KL-Shampoo.
The method is effective across different model scales and training stages.
Abstract
Optimizers that exploit the matrix structure of gradients are central to modern LLM pre-training, with two distinct frontiers: explicit Kronecker-factored preconditioning -- most recently KL-Shampoo, which estimates the preconditioner via KL divergence minimization -- and orthogonalization of the gradient momentum, exemplified by Muon and analyzed as steepest descent under the spectral norm. The two routes are typically developed in isolation. We make a structural observation about KL-Shampoo's Kronecker preconditioners: their eigenvalue spectra exhibit a \emph{spike-and-flat} shape -- a few dominant eigenvalues followed by an approximately uniform tail -- across layers and training stages, holding exactly under a rank- signal-plus-noise gradient model. We exploit this structure by restricting one of KL-Shampoo's Kronecker factors to a parametric family aligned with the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
