Dflow, a Python framework for constructing cloud-native AI-for-Science workflows
Xinzijian Liu, Yanbo Han, Zhuoyuan Li, Jiahao Fan, Chengqian Zhang,, Jinzhe Zeng, Yifan Shan, Yannan Yuan, Wei-Hong Xu, Yun-Pei Liu, Yuzhi Zhang,, Tongqi Wen, Darrin M. York, Zhicheng Zhong, Hang Zheng, Jun Cheng, Linfeng, Zhang, Han Wang

TL;DR
Dflow is an open-source Python framework that simplifies constructing scalable, cloud-native workflows for AI-driven scientific computing, supporting complex process control and task scheduling across diverse infrastructures.
Contribution
Introduces Dflow, a flexible, observable, and scalable Python toolkit for building reproducible workflows on cloud and supercomputers, promoting reusability and collaboration.
Findings
Supports thousands of concurrent nodes per workflow
Enables complex process control and task scheduling
Has been adopted in dozens of scientific projects
Abstract
In the AI-for-science era, scientific computing scenarios such as concurrent learning and high-throughput computing demand a new generation of infrastructure that supports scalable computing resources and automated workflow management on both cloud and high-performance supercomputers. Here we introduce Dflow, an open-source Python toolkit designed for scientists to construct workflows with simple programming interfaces. It enables complex process control and task scheduling across a distributed, heterogeneous infrastructure, leveraging containers and Kubernetes for flexibility. Dflow is highly observable and can scale to thousands of concurrent nodes per workflow, enhancing the efficiency of complex scientific computing tasks. The basic unit in Dflow, known as an Operation (OP), is reusable and independent of the underlying infrastructure or context. Dozens of workflow projects have…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Computing and Data Management · Distributed and Parallel Computing Systems · Computational Physics and Python Applications
