A New Perspective on Shampoo's Preconditioner

Depen Morwani; Itai Shapira; Nikhil Vyas; Eran Malach; Sham Kakade,; Lucas Janson

arXiv:2406.17748·cs.LG·June 26, 2024

A New Perspective on Shampoo's Preconditioner

Depen Morwani, Itai Shapira, Nikhil Vyas, Eran Malach, Sham Kakade,, Lucas Janson

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper explores the theoretical foundations of Shampoo's Kronecker product preconditioner, revealing its connection to optimal matrix approximations and demonstrating its effectiveness across datasets and architectures.

Contribution

It provides a novel theoretical link between Shampoo's approximation and the optimal Kronecker product, clarifying misconceptions and analyzing practical efficiency tricks.

Findings

01

Shampoo's approximation is close to the optimal Kronecker product approximation.

02

The square of Shampoo's approximation relates to power iteration for matrix approximation.

03

Practical tricks like using batch gradients improve Hessian approximation quality.

Abstract

Shampoo, a second-order optimization algorithm which uses a Kronecker product preconditioner, has recently garnered increasing attention from the machine learning community. The preconditioner used by Shampoo can be viewed either as an approximation of the Gauss--Newton component of the Hessian or the covariance matrix of the gradients maintained by Adagrad. We provide an explicit and novel connection between the $optimal$ Kronecker product approximation of these matrices and the approximation made by Shampoo. Our connection highlights a subtle but common misconception about Shampoo's approximation. In particular, the $square$ of the approximation used by the Shampoo optimizer is equivalent to a single step of the power iteration algorithm for computing the aforementioned optimal Kronecker product approximation. Across a variety of datasets and architectures we…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 3

Strengths

The authors theoretically and empirically demonstrate that the square of Shampoo’s approximation of $H$ is exactly equivalent to one round of the power iteration method, aimed at obtaining the optimal Kronecker-factored approximation of the matrix $H$. This finding offers a clear theoretical basis for Shampoo’s preconditioner, revealing that its approximation is more rigorous and methodical than previously understood, with direct ties to the power iteration process. This insight deepens our unde

Weaknesses

Despite the fact that I find the results really interesting, I still have my concerns about the relevance of this article to the conference. Below I will try to explain what I mean (if I am wrong, I am open to discussion): - This paper can hardly be called theoretical, since the authors cite previous work for all lemmas in the paper; - The whole paper is built on one result (see Proposition 1), which seems to be a mynor result. I also noticed some typos: - Line 104: Kronecker product ($\ot

Reviewer 02Rating 8Confidence 3

Strengths

Novel insight interpreting a popular optimization algorithm, with extensions to batches and a real data regime. Insights have already been used to create new optimization algorithms which perform well. A good number of numerical experiments clearly verify the claims made about the similarity with the optimal Kronecker product approximation and the reason for choosing L and R as multiples of the identity. The work is succinct with a clear plotting style.

Weaknesses

Experimental details are contained in the Appendix, and we see that cosine similarity changes over the training steps, an investigation into whether this is optimization trajectory independent (or at least using the Shampoo or Shampoo^2 algorithm) could be performed.

Reviewer 03Rating 8Confidence 4

Strengths

**Clarity:** Overall, I found the paper easy and enjoyable to read. The paper is very well-organized, and the authors make all their points clearly. For instance, usually, each theoretical result comes with a plot demonstrating the result empirically. This was very nice, made the presentation more concrete, and helped to immediately validate the theory. **Contribution:** The theoretical insights into the Shampoo preconditioner are valuable and timely, as Shampoo is arguably the most practic

Weaknesses

**Introduction** The first paragraph's discussion of the cost of 2nd-order methods should be conditioned a bit; otherwise, it is a bit misleading. A quadratic space requirement and a cubic computational complexity only arise if you naively try to apply classical techniques like Newton's method to Deep Learning. By leveraging automatic differentiation, we can compute hvps (which is usually all we need) without forming the Hessian at a cost of $\mathcal O(np)$, where $n$ is the number of samples

Videos

A New Perspective on Shampoo's Preconditioner· slideslive

Taxonomy

TopicsVibration Control and Rheological Fluids · Acoustic Wave Phenomena Research

MethodsSoftmax · Attention Is All You Need