A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale
Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose, Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and, Michael Rabbat

TL;DR
This paper presents a distributed, data-parallel implementation of the Shampoo optimizer in PyTorch, enabling scalable training of neural networks with minimal performance overhead and demonstrating its effectiveness on ImageNet ResNet50.
Contribution
The work introduces a complete, optimized PyTorch implementation of Shampoo that supports multi-GPU distributed training and demonstrates its advantages over standard methods.
Findings
Achieves at most 10% slower per-step time compared to diagonal methods.
Demonstrates Shampoo's superiority on ImageNet ResNet50 with minimal hyperparameter tuning.
Enables scalable training of deep networks at scale.
Abstract
Shampoo is an online and stochastic optimization algorithm belonging to the AdaGrad family of methods for training neural networks. It constructs a block-diagonal preconditioner where each block consists of a coarse Kronecker product approximation to full-matrix AdaGrad for each parameter of the neural network. In this work, we provide a complete description of the algorithm as well as the performance optimizations that our implementation leverages to train deep networks at-scale in PyTorch. Our implementation enables fast multi-GPU distributed data-parallel training by distributing the memory and computation associated with blocks of each parameter via PyTorch's DTensor data structure and performing an AllGather primitive on the computed search directions at each iteration. This major performance enhancement enables us to achieve at most a 10% performance reduction in per-step…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Neural Networks and Applications
MethodsAdaGrad
