OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning

Youhe Jiang; Fangcheng Fu; Xupeng Miao; Xiaonan Nie; Bin; Cui

arXiv:2209.13258·cs.DC·May 22, 2023

OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning

Youhe Jiang, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Bin, Cui

PDF

Open Access 2 Repos

TL;DR

OSDP is an automated system that optimally combines data and model parallelism for distributed deep learning, improving training efficiency and enabling larger models with higher throughput.

Contribution

It introduces OSDP, a novel automated parallel training system that balances memory and hardware utilization, and employs operator splitting to reduce peak memory usage during training.

Findings

01

Outperforms state-of-the-art methods in multiple large-scale models

02

Enables training of larger models with higher throughput

03

Demonstrates significant efficiency improvements in distributed training

Abstract

Large-scale deep learning models contribute to significant performance improvements on varieties of downstream tasks. Current data and model parallelism approaches utilize model replication and partition techniques to support the distributed training of ultra-large models. However, directly deploying these systems often leads to sub-optimal training efficiency due to the complex model architectures and the strict device memory constraints. In this paper, we propose Optimal Sharded Data Parallel (OSDP), an automated parallel training system that combines the advantages from both data and model parallelism. Given the model description and the device information, OSDP makes trade-offs between the memory consumption and the hardware utilization, thus automatically generates the distributed computation graph and maximizes the overall system throughput. In addition, OSDP introduces operator…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning