The TensorFlow Partitioning and Scheduling Problem: It's the Critical Path!
Ruben Mayer, Christian Mayer, Larissa Laich

TL;DR
This paper addresses the complex problem of partitioning and scheduling in TensorFlow's data flow graphs on heterogeneous devices, proposing heuristics that significantly improve execution time by focusing on the critical path.
Contribution
It introduces novel heuristic strategies for combined partitioning and scheduling in TensorFlow, emphasizing critical path minimization to enhance performance.
Findings
Critical path-focused heuristics outperform agnostic strategies.
Up to 4x speed-up achieved with the proposed heuristics.
Simulation results demonstrate effectiveness in communication-intensive workloads.
Abstract
State-of-the-art data flow systems such as TensorFlow impose iterative calculations on large graphs that need to be partitioned on heterogeneous devices such as CPUs, GPUs, and TPUs. However, partitioning can not be viewed in isolation. Each device has to select the next graph vertex to be executed, i.e., perform local scheduling decisions. Both problems, partitioning and scheduling, are NP-complete by themselves but have to be solved in combination in order to minimize overall execution time of an iteration. In this paper, we propose several heuristic strategies to solve the partitioning and scheduling problem in TensorFlow. We simulate the performance of the proposed strategies in heterogeneous environments with communication-intensive workloads that are common to TensorFlow. Our findings indicate that the best partitioning and scheduling heuristics are those that focus on minimizing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
