Optimizing Task Placement and Online Scheduling for Distributed GNN Training Acceleration
Ziyue Luo, Yixin Bao, Chuan Wu

TL;DR
This paper presents a novel framework for task placement and online scheduling to accelerate distributed GNN training, addressing data transmission bottlenecks and resource utilization issues.
Contribution
It introduces an integrated algorithm framework with online scheduling and task placement schemes specifically designed for distributed GNN training.
Findings
Achieved up to 67% training speed-up over baselines.
Improved resource utilization and execution pipelining.
Effective handling of large graph data transmission challenges.
Abstract
Training Graph Neural Networks (GNN) on large graphs is resource-intensive and time-consuming, mainly due to the large graph data that cannot be fit into the memory of a single machine, but have to be fetched from distributed graph storage and processed on the go. Unlike distributed deep neural network (DNN) training, the bottleneck in distributed GNN training lies largely in large graph data transmission for constructing mini-batches of training samples. Existing solutions often advocate data-computation colocation, and do not work well with limited resources where the colocation is infeasible. The potentials of strategical task placement and optimal scheduling of data transmission and task execution have not been well explored. This paper designs an efficient algorithm framework for task placement and execution scheduling of distributed GNN training, to better resource utilization,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
