Shampoo: Preconditioned Stochastic Tensor Optimization
Vineet Gupta, Tomer Koren, Yoram Singer

TL;DR
Shampoo is a novel preconditioning algorithm for stochastic tensor optimization that improves convergence speed in deep learning models while maintaining comparable runtime to standard optimizers.
Contribution
We introduce Shampoo, a structure-aware preconditioning method for tensor spaces with proven convergence guarantees and practical efficiency in deep learning.
Findings
Shampoo converges faster than SGD, AdaGrad, and Adam in deep learning tasks.
Shampoo's runtime per step is comparable to standard optimizers.
Theoretical convergence guarantees are established for convex stochastic settings.
Abstract
Preconditioned gradient methods are among the most general and powerful tools in optimization. However, preconditioning requires storing and manipulating prohibitively large matrices. We describe and analyze a new structure-aware preconditioning algorithm, called Shampoo, for stochastic optimization over tensor spaces. Shampoo maintains a set of preconditioning matrices, each of which operates on a single dimension, contracting over the remaining dimensions. We establish convergence guarantees in the stochastic convex setting, the proof of which builds upon matrix trace inequalities. Our experiments with state-of-the-art deep learning models show that Shampoo is capable of converging considerably faster than commonly used optimizers. Although it involves a more complex update rule, Shampoo's runtime per step is comparable to that of simple gradient methods such as SGD, AdaGrad, and Adam.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStochastic Gradient Optimization Techniques · Tensor decomposition and applications · Sparse and Compressive Sensing Techniques
MethodsStochastic Gradient Descent · Adam · AdaGrad
