OneFlow: Redesign the Distributed Deep Learning Framework from Scratch
Jinhui Yuan, Xinqi Li, Cheng Cheng, Juncheng Liu, Ran Guo, and Shenghang Cai, Chi Yao, Fei Yang, Xiaodong Yi, Chuan Wu and, Haoran Zhang, Jie Zhao

TL;DR
OneFlow is a new distributed deep learning framework that simplifies programming for various parallelism paradigms using SBP abstraction and actor model, demonstrating improved efficiency and flexibility over existing frameworks.
Contribution
It introduces SBP and actor model-based design for a flexible, efficient distributed training framework that surpasses existing solutions in large model training.
Findings
Outperforms well-known customized libraries.
Efficient training of large DNN models.
Simplifies programming of parallelism paradigms.
Abstract
Deep learning frameworks such as TensorFlow and PyTorch provide a productive interface for expressing and training a deep neural network (DNN) model on a single device or using data parallelism. Still, they may not be flexible or efficient enough in training emerging large models on distributed devices, which require more sophisticated parallelism beyond data parallelism. Plugins or wrappers have been developed to strengthen these frameworks for model or pipeline parallelism, but they complicate the usage and implementation of distributed deep learning. Aiming at a simple, neat redesign of distributed deep learning frameworks for various parallelism paradigms, we present OneFlow, a novel distributed training framework based on an SBP (split, broadcast and partial-value) abstraction and the actor model. SBP enables much easier programming of data parallelism and model parallelism than…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Ferroelectric and Negative Capacitance Devices
