Characterizing and Understanding Distributed GNN Training on GPUs

Haiyang Lin; Mingyu Yan; Xiaocheng Yang; Mo Zou; Wenming Li; Xiaochun; Ye; Dongrui Fan

arXiv:2204.08150·cs.DC·April 19, 2022

Characterizing and Understanding Distributed GNN Training on GPUs

Haiyang Lin, Mingyu Yan, Xiaocheng Yang, Mo Zou, Wenming Li, Xiaochun, Ye, Dongrui Fan

PDF

Open Access

TL;DR

This paper provides an in-depth analysis of distributed GNN training on GPUs, revealing key insights and guidelines to optimize performance for large-scale graph neural network training.

Contribution

It offers the first comprehensive analysis of distributed GNN training on GPUs, highlighting performance bottlenecks and optimization strategies.

Findings

01

Identifies key performance bottlenecks in distributed GNN training on GPUs.

02

Provides practical guidelines for software and hardware optimization.

03

Enhances understanding of distributed GNN training execution on GPU clusters.

Abstract

Graph neural network (GNN) has been demonstrated to be a powerful model in many domains for its effectiveness in learning over graphs. To scale GNN training for large graphs, a widely adopted approach is distributed training which accelerates training using multiple computing nodes. Maximizing the performance is essential, but the execution of distributed GNN training remains preliminarily understood. In this work, we provide an in-depth analysis of distributed GNN training on GPUs, revealing several significant observations and providing useful guidelines for both software optimization and hardware optimization.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Ferroelectric and Negative Capacitance Devices · Machine Learning in Materials Science