ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning
Samyam Rajbhandari, Olatunji Ruwase, Jeff Rasley, Shaden Smith,, Yuxiong He

TL;DR
ZeRO-Infinity introduces a heterogeneous system technology that enables training of extremely large models with hundreds of trillions of parameters across limited GPU resources by leveraging GPU, CPU, and NVMe memory, without model refactoring.
Contribution
It presents ZeRO-Infinity, a novel system that significantly extends the scale of trainable models on existing hardware by integrating multiple memory types and optimizing scalability.
Findings
Supports models with hundreds of trillions of parameters.
Achieves over 25 petaflops on 512 GPUs at 40% peak performance.
Enables fine-tuning trillion-parameter models on a single node.
Abstract
In the last three years, the largest dense deep learning models have grown over 1000x to reach hundreds of billions of parameters, while the GPU memory has only grown by 5x (16 GB to 80 GB). Therefore, the growth in model scale has been supported primarily though system innovations that allow large models to fit in the aggregate GPU memory of multiple GPUs. However, we are getting close to the GPU memory wall. It requires 800 NVIDIA V100 GPUs just to fit a trillion parameter model for training, and such clusters are simply out of reach for most data scientists. In addition, training models at that scale requires complex combinations of parallelism techniques that puts a big burden on the data scientists to refactor their model. In this paper we present ZeRO-Infinity, a novel heterogeneous system technology that leverages GPU, CPU, and NVMe memory to allow for unprecedented model scale…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Parallel Computing and Optimization Techniques
MethodsZeRO-Infinity
