ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Samyam Rajbhandari; Jeff Rasley; Olatunji Ruwase; Yuxiong He

arXiv:1910.02054·cs.LG·May 14, 2020·72 cites

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He

PDF

Open Access 5 Repos 10 Models 1 Video

TL;DR

ZeRO introduces memory optimization techniques that enable efficient training of trillion-parameter models by eliminating redundancies, significantly increasing model size and training speed on existing hardware.

Contribution

The paper presents ZeRO, a novel optimizer that reduces memory redundancy in large-scale model training, allowing models to scale beyond 1 trillion parameters with high efficiency.

Findings

01

Trains models over 100B parameters with super-linear speedup on 400 GPUs.

02

Achieves 15 Petaflops throughput, 8x larger models, and 10x performance gains.

03

Supports training models up to 13B parameters without model parallelism.

Abstract

Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

Turing-NLG, DeepSpeed and the ZeRO optimizer· youtube

Taxonomy

TopicsAdvanced Neural Network Applications · Topic Modeling · Ferroelectric and Negative Capacitance Devices

MethodsLinear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · ZeRO · Cosine Annealing · Softmax · Layer Normalization · Attention Is All You Need · Discriminative Fine-Tuning · Residual Connection · Multi-Head Attention