A New Perspective on Shampoo's Preconditioner
Depen Morwani, Itai Shapira, Nikhil Vyas, Eran Malach, Sham Kakade,, Lucas Janson

TL;DR
This paper explores the theoretical foundations of Shampoo's Kronecker product preconditioner, revealing its connection to optimal matrix approximations and demonstrating its effectiveness across datasets and architectures.
Contribution
It provides a novel theoretical link between Shampoo's approximation and the optimal Kronecker product, clarifying misconceptions and analyzing practical efficiency tricks.
Findings
Shampoo's approximation is close to the optimal Kronecker product approximation.
The square of Shampoo's approximation relates to power iteration for matrix approximation.
Practical tricks like using batch gradients improve Hessian approximation quality.
Abstract
Shampoo, a second-order optimization algorithm which uses a Kronecker product preconditioner, has recently garnered increasing attention from the machine learning community. The preconditioner used by Shampoo can be viewed either as an approximation of the Gauss--Newton component of the Hessian or the covariance matrix of the gradients maintained by Adagrad. We provide an explicit and novel connection between the Kronecker product approximation of these matrices and the approximation made by Shampoo. Our connection highlights a subtle but common misconception about Shampoo's approximation. In particular, the of the approximation used by the Shampoo optimizer is equivalent to a single step of the power iteration algorithm for computing the aforementioned optimal Kronecker product approximation. Across a variety of datasets and architectures we…
Peer Reviews
Decision·ICLR 2025 Poster
The authors theoretically and empirically demonstrate that the square of Shampoo’s approximation of $H$ is exactly equivalent to one round of the power iteration method, aimed at obtaining the optimal Kronecker-factored approximation of the matrix $H$. This finding offers a clear theoretical basis for Shampoo’s preconditioner, revealing that its approximation is more rigorous and methodical than previously understood, with direct ties to the power iteration process. This insight deepens our unde
Despite the fact that I find the results really interesting, I still have my concerns about the relevance of this article to the conference. Below I will try to explain what I mean (if I am wrong, I am open to discussion): - This paper can hardly be called theoretical, since the authors cite previous work for all lemmas in the paper; - The whole paper is built on one result (see Proposition 1), which seems to be a mynor result. I also noticed some typos: - Line 104: Kronecker product ($\ot
Novel insight interpreting a popular optimization algorithm, with extensions to batches and a real data regime. Insights have already been used to create new optimization algorithms which perform well. A good number of numerical experiments clearly verify the claims made about the similarity with the optimal Kronecker product approximation and the reason for choosing L and R as multiples of the identity. The work is succinct with a clear plotting style.
Experimental details are contained in the Appendix, and we see that cosine similarity changes over the training steps, an investigation into whether this is optimization trajectory independent (or at least using the Shampoo or Shampoo^2 algorithm) could be performed.
**Clarity:** Overall, I found the paper easy and enjoyable to read. The paper is very well-organized, and the authors make all their points clearly. For instance, usually, each theoretical result comes with a plot demonstrating the result empirically. This was very nice, made the presentation more concrete, and helped to immediately validate the theory. **Contribution:** The theoretical insights into the Shampoo preconditioner are valuable and timely, as Shampoo is arguably the most practic
**Introduction** The first paragraph's discussion of the cost of 2nd-order methods should be conditioned a bit; otherwise, it is a bit misleading. A quadratic space requirement and a cubic computational complexity only arise if you naively try to apply classical techniques like Newton's method to Deep Learning. By leveraging automatic differentiation, we can compute hvps (which is usually all we need) without forming the Hessian at a cost of $\mathcal O(np)$, where $n$ is the number of samples
Videos
Taxonomy
TopicsVibration Control and Rheological Fluids · Acoustic Wave Phenomena Research
MethodsSoftmax · Attention Is All You Need
