Characterizing and Understanding HGNN Training on GPUs

Dengke Han; Mingyu Yan; Xiaochun Ye; Dongrui Fan

arXiv:2407.11790·cs.LG·October 30, 2024

Characterizing and Understanding HGNN Training on GPUs

Dengke Han, Mingyu Yan, Xiaochun Ye, Dongrui Fan

PDF

Open Access

TL;DR

This paper analyzes the training process of Heterogeneous Graph Neural Networks on GPUs, identifying performance bottlenecks and providing optimization strategies for both single-GPU and multi-GPU scenarios.

Contribution

It offers a detailed characterization of HGNN training on GPUs, revealing bottlenecks and proposing optimization guidelines for improved efficiency.

Findings

01

Identified key performance bottlenecks in HGNN GPU training

02

Analyzed differences between single-GPU and multi-GPU training scenarios

03

Provided optimization strategies for software and hardware improvements

Abstract

Owing to their remarkable representation capabilities for heterogeneous graph data, Heterogeneous Graph Neural Networks (HGNNs) have been widely adopted in many critical real-world domains such as recommendation systems and medical analysis. Prior to their practical application, identifying the optimal HGNN model parameters tailored to specific tasks through extensive training is a time-consuming and costly process. To enhance the efficiency of HGNN training, it is essential to characterize and analyze the execution semantics and patterns within the training process to identify performance bottlenecks. In this study, we conduct an in-depth quantification and analysis of two mainstream HGNN training scenarios, including single-GPU and multi-GPU distributed training. Based on the characterization results, we disclose the performance bottlenecks and their underlying causes in different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Robotics and Automated Systems · Distributed and Parallel Computing Systems