ZeRO-Offload: Democratizing Billion-Scale Model Training
Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase,, Shuangyan Yang, Minjia Zhang, Dong Li, Yuxiong He

TL;DR
ZeRO-Offload enables training billion-parameter models on a single GPU by offloading data and compute to CPU, significantly reducing hardware requirements and making large-scale model training accessible to more researchers.
Contribution
It introduces a novel offloading technique that allows training of extremely large models on minimal hardware without model modifications or efficiency loss.
Findings
Supports training models over 13 billion parameters on a single GPU.
Achieves near-linear scaling on up to 128 GPUs.
Enables training models over 70 billion parameters on a single DGX-2.
Abstract
Large-scale model training has been a playing ground for a limited few requiring complex model refactoring and access to prohibitively expensive GPU clusters. ZeRO-Offload changes the large model training landscape by making large model training accessible to nearly everyone. It can train models with over 13 billion parameters on a single GPU, a 10x increase in size compared to popular framework such as PyTorch, and it does so without requiring any model change from the data scientists or sacrificing computational efficiency. ZeRO-Offload enables large model training by offloading data and compute to CPU. To preserve compute efficiency, it is designed to minimize the data movement to/from GPU, and reduce CPU compute time while maximizing memory savings on GPU. As a result, ZeRO-Offload can achieve 40 TFlops/GPU on a single NVIDIA V100 GPU for 10B parameter model compared to 30TF using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Topic Modeling · Parallel Computing and Optimization Techniques
MethodsZeRO-Offload
