OSDP: Optimal Sharded Data Parallel for Distributed Deep Learning
Youhe Jiang, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Bin, Cui

TL;DR
OSDP is an automated system that optimally combines data and model parallelism for distributed deep learning, improving training efficiency and enabling larger models with higher throughput.
Contribution
It introduces OSDP, a novel automated parallel training system that balances memory and hardware utilization, and employs operator splitting to reduce peak memory usage during training.
Findings
Outperforms state-of-the-art methods in multiple large-scale models
Enables training of larger models with higher throughput
Demonstrates significant efficiency improvements in distributed training
Abstract
Large-scale deep learning models contribute to significant performance improvements on varieties of downstream tasks. Current data and model parallelism approaches utilize model replication and partition techniques to support the distributed training of ultra-large models. However, directly deploying these systems often leads to sub-optimal training efficiency due to the complex model architectures and the strict device memory constraints. In this paper, we propose Optimal Sharded Data Parallel (OSDP), an automated parallel training system that combines the advantages from both data and model parallelism. Given the model description and the device information, OSDP makes trade-offs between the memory consumption and the hardware utilization, thus automatically generates the distributed computation graph and maximizes the overall system throughput. In addition, OSDP introduces operator…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Domain Adaptation and Few-Shot Learning
