ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He

TL;DR
ZeRO introduces memory optimization techniques that enable efficient training of trillion-parameter models by eliminating redundancies, significantly increasing model size and training speed on existing hardware.
Contribution
The paper presents ZeRO, a novel optimizer that reduces memory redundancy in large-scale model training, allowing models to scale beyond 1 trillion parameters with high efficiency.
Findings
Trains models over 100B parameters with super-linear speedup on 400 GPUs.
Achieves 15 Petaflops throughput, 8x larger models, and 10x performance gains.
Supports training models up to 13B parameters without model parallelism.
Abstract
Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗stabilityai/stablecode-completion-alpha-3bmodel· 147 dl· ♡ 117147 dl♡ 117
- 🤗stabilityai/stablelm-base-alpha-7b-v2model· 6.6k dl· ♡ 446.6k dl♡ 44
- 🤗stabilityai/stablelm-base-alpha-3b-v2model· 324 dl· ♡ 27324 dl♡ 27
- 🤗stabilityai/stablecode-completion-alpha-3b-4kmodel· 1.1k dl· ♡ 2771.1k dl♡ 277
- 🤗TheBloke/stablecode-completion-alpha-3b-4k-GGMLmodel· 24 dl· ♡ 2324 dl♡ 23
- 🤗TheBloke/stablecode-completion-alpha-3b-4k-GPTQmodel· 21 dl· ♡ 921 dl♡ 9
- 🤗justinmeans/stablecode-completion-alpha-3b-4k-coremlmodel· ♡ 1♡ 1
- 🤗rtlabs/StableCode-3Bmodel· 12 dl· ♡ 112 dl♡ 1
- 🤗DominikLindorfer/SQL-LLaMAmodel
- 🤗stabilityai/stablelm-3b-4e1tmodel· 35k dl· ♡ 31235k dl♡ 312
Videos
Turing-NLG, DeepSpeed and the ZeRO optimizer· youtube
Taxonomy
TopicsAdvanced Neural Network Applications · Topic Modeling · Ferroelectric and Negative Capacitance Devices
MethodsLinear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · ZeRO · Cosine Annealing · Softmax · Layer Normalization · Attention Is All You Need · Discriminative Fine-Tuning · Residual Connection · Multi-Head Attention
