GSPMD: General and Scalable Parallelization for ML Computation Graphs
Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping, Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello, Maggioni, Ruoming Pang, Noam Shazeer, Shibo Wang, Tao Wang, Yonghui Wu,, Zhifeng Chen

TL;DR
GSPMD is an automatic, compiler-based system that enables scalable parallelization of machine learning models, achieving high compute utilization on large-scale hardware with minimal user annotations.
Contribution
It introduces a flexible partitioning representation and inference method that simplifies scaling single-device ML programs to large distributed systems.
Findings
Achieves 50% to 62% compute utilization on 2048 TPUv3 cores.
Supports models with up to one trillion parameters.
Enables scalable parallelization with minimal user effort.
Abstract
We present GSPMD, an automatic, compiler-based parallelization system for common machine learning computations. It allows users to write programs in the same way as for a single device, then give hints through a few annotations on how to distribute tensors, based on which GSPMD will parallelize the computation. Its representation of partitioning is simple yet general, allowing it to express different or mixed paradigms of parallelism on a wide variety of models. GSPMD infers the partitioning for every operator based on limited user annotations, making it convenient to scale existing single-device programs. It solves several technical challenges for production usage, allowing GSPMD to achieve 50% to 62% compute utilization on up to 2048 Cloud TPUv3 cores for models with up to one trillion parameters.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Advanced Data Storage Technologies
