PyTorch Distributed: Experiences on Accelerating Data Parallel Training
Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis,, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, Soumith, Chintala

TL;DR
This paper discusses the design, implementation, and evaluation of PyTorch's distributed data parallel module, demonstrating near-linear scalability on 256 GPUs for large-scale deep learning training.
Contribution
It introduces optimized techniques in PyTorch for distributed data parallel training, improving efficiency and scalability on large GPU clusters.
Findings
Achieves near-linear scalability with 256 GPUs
Implements gradient bucketing and overlapping communication
Provides techniques to optimize distributed training
Abstract
This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Data Quality and Management · Machine Learning and Data Classification
MethodsPyTorch DDP
