What Really Matters in Matrix-Whitening Optimizers?
Kevin Frans, Pieter Abbeel, Sergey Levine

TL;DR
This paper systematically analyzes matrix-whitening optimizers, revealing that variance adaptation, rather than spectral normalization, is the key factor behind their superior performance over elementwise methods like Adam.
Contribution
It identifies variance adaptation as the crucial component in matrix-whitening optimizers that explains their performance advantage, challenging the focus on spectral normalization.
Findings
Variance adaptation consistently improves optimizer performance.
Spectral normalization alone does not account for performance gains.
Low-rank variance estimators reduce memory costs without sacrificing accuracy.
Abstract
A range of recent optimizers have emerged that approximate the same "matrix-whitening" transformation in various ways. In this work, we systematically deconstruct such optimizers, aiming to disentangle the key components that explain performance. Across tuned hyperparameters across the board, all flavors of matrix-whitening methods reliably outperform elementwise counterparts, such as Adam. Matrix-whitening is often related to spectral descent -- however, experiments reveal that performance gains are *not explained solely by accurate spectral normalization* -- particularly, SOAP displays the largest per-step gain, even though Muon more accurately descends along the steepest spectral descent direction. Instead, we argue that matrix-whitening serves two purposes, and the variance adaptation component of matrix-whitening is the overlooked ingredient explaining this performance gap.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
