A Distributed Data-Parallel PyTorch Implementation of the Distributed   Shampoo Optimizer for Training Neural Networks At-Scale

Hao-Jun Michael Shi; Tsung-Hsien Lee; Shintaro Iwasaki; Jose; Gallego-Posada; Zhijing Li; Kaushik Rangadurai; Dheevatsa Mudigere; and; Michael Rabbat

arXiv:2309.06497·cs.LG·September 14, 2023

A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale

Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose, Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and, Michael Rabbat

PDF

Open Access 3 Repos 2 Models

TL;DR

This paper presents a distributed, data-parallel implementation of the Shampoo optimizer in PyTorch, enabling scalable training of neural networks with minimal performance overhead and demonstrating its effectiveness on ImageNet ResNet50.

Contribution

The work introduces a complete, optimized PyTorch implementation of Shampoo that supports multi-GPU distributed training and demonstrates its advantages over standard methods.

Findings

01

Achieves at most 10% slower per-step time compared to diagonal methods.

02

Demonstrates Shampoo's superiority on ImageNet ResNet50 with minimal hyperparameter tuning.

03

Enables scalable training of deep networks at scale.

Abstract

Shampoo is an online and stochastic optimization algorithm belonging to the AdaGrad family of methods for training neural networks. It constructs a block-diagonal preconditioner where each block consists of a coarse Kronecker product approximation to full-matrix AdaGrad for each parameter of the neural network. In this work, we provide a complete description of the algorithm as well as the performance optimizations that our implementation leverages to train deep networks at-scale in PyTorch. Our implementation enables fast multi-GPU distributed data-parallel training by distributing the memory and computation associated with blocks of each parameter via PyTorch's DTensor data structure and performing an AllGather primitive on the computed search directions at each iteration. This major performance enhancement enables us to achieve at most a 10% performance reduction in per-step…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Neural Networks and Applications

MethodsAdaGrad