OneFlow: Redesign the Distributed Deep Learning Framework from Scratch

Jinhui Yuan; Xinqi Li; Cheng Cheng; Juncheng Liu; Ran Guo; and Shenghang Cai; Chi Yao; Fei Yang; Xiaodong Yi; Chuan Wu and; Haoran Zhang; Jie Zhao

arXiv:2110.15032·cs.DC·April 20, 2022·34 cites

OneFlow: Redesign the Distributed Deep Learning Framework from Scratch

Jinhui Yuan, Xinqi Li, Cheng Cheng, Juncheng Liu, Ran Guo, and Shenghang Cai, Chi Yao, Fei Yang, Xiaodong Yi, Chuan Wu and, Haoran Zhang, Jie Zhao

PDF

Open Access 1 Repo

TL;DR

OneFlow is a new distributed deep learning framework that simplifies programming for various parallelism paradigms using SBP abstraction and actor model, demonstrating improved efficiency and flexibility over existing frameworks.

Contribution

It introduces SBP and actor model-based design for a flexible, efficient distributed training framework that surpasses existing solutions in large model training.

Findings

01

Outperforms well-known customized libraries.

02

Efficient training of large DNN models.

03

Simplifies programming of parallelism paradigms.

Abstract

Deep learning frameworks such as TensorFlow and PyTorch provide a productive interface for expressing and training a deep neural network (DNN) model on a single device or using data parallelism. Still, they may not be flexible or efficient enough in training emerging large models on distributed devices, which require more sophisticated parallelism beyond data parallelism. Plugins or wrappers have been developed to strengthen these frameworks for model or pipeline parallelism, but they complicate the usage and implementation of distributed deep learning. Aiming at a simple, neat redesign of distributed deep learning frameworks for various parallelism paradigms, we present OneFlow, a novel distributed training framework based on an SBP (split, broadcast and partial-value) abstraction and the actor model. SBP enables much easier programming of data parallelism and model parallelism than…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Oneflow-Inc/oneflow
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Ferroelectric and Negative Capacitance Devices