Shampoo: Preconditioned Stochastic Tensor Optimization

Vineet Gupta; Tomer Koren; Yoram Singer

arXiv:1802.09568·cs.LG·March 5, 2018·33 cites

Shampoo: Preconditioned Stochastic Tensor Optimization

Vineet Gupta, Tomer Koren, Yoram Singer

PDF

Open Access 3 Repos 2 Models

TL;DR

Shampoo is a novel preconditioning algorithm for stochastic tensor optimization that improves convergence speed in deep learning models while maintaining comparable runtime to standard optimizers.

Contribution

We introduce Shampoo, a structure-aware preconditioning method for tensor spaces with proven convergence guarantees and practical efficiency in deep learning.

Findings

01

Shampoo converges faster than SGD, AdaGrad, and Adam in deep learning tasks.

02

Shampoo's runtime per step is comparable to standard optimizers.

03

Theoretical convergence guarantees are established for convex stochastic settings.

Abstract

Preconditioned gradient methods are among the most general and powerful tools in optimization. However, preconditioning requires storing and manipulating prohibitively large matrices. We describe and analyze a new structure-aware preconditioning algorithm, called Shampoo, for stochastic optimization over tensor spaces. Shampoo maintains a set of preconditioning matrices, each of which operates on a single dimension, contracting over the remaining dimensions. We establish convergence guarantees in the stochastic convex setting, the proof of which builds upon matrix trace inequalities. Our experiments with state-of-the-art deep learning models show that Shampoo is capable of converging considerably faster than commonly used optimizers. Although it involves a more complex update rule, Shampoo's runtime per step is comparable to that of simple gradient methods such as SGD, AdaGrad, and Adam.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsStochastic Gradient Optimization Techniques · Tensor decomposition and applications · Sparse and Compressive Sensing Techniques

MethodsStochastic Gradient Descent · Adam · AdaGrad