Accelerating Mini-batch HGNN Training by Reducing CUDA Kernels

Meng Wu; Jingkai Qiu; Mingyu Yan; Wenming Li; Yang Zhang; Zhimin; Zhang; Xiaochun Ye; and Dongrui Fan

arXiv:2408.08490·cs.AR·August 19, 2024

Accelerating Mini-batch HGNN Training by Reducing CUDA Kernels

Meng Wu, Jingkai Qiu, Mingyu Yan, Wenming Li, Yang Zhang, Zhimin, Zhang, Xiaochun Ye, and Dongrui Fan

PDF

Open Access

TL;DR

HiFuse significantly accelerates mini-batch HGNN training on CPU-GPU systems by reorganizing data, reducing CUDA kernel launches, and overlapping CPU-GPU tasks, achieving over twice the speed of previous methods.

Contribution

This paper introduces HiFuse, a novel system that improves GPU utilization and training speed for HGNNs by data restructuring and offloading graph construction to CPU with parallel processing.

Findings

01

Achieves 2.38x average speedup over state-of-the-art solutions.

02

Reduces CUDA kernel launches by merging vertex feature matrices.

03

Enhances GPU utilization through CPU-GPU overlapping and parallelization.

Abstract

Heterogeneous graph neural networks (HGNNs) are essential for capturing the structure and semantic information in heterogeneous graphs. However, existing GPU-based solutions, such as PyTorch Geometric, suffer from low GPU utilization due to numerous short-execution-time and memory-bound CUDA kernels during HGNN training. To address this issue, we introduce HiFuse, an enhancement for PyTorch Geometric designed to accelerate mini-batch HGNN training on CPU-GPU systems. From the data perspective, we reorganize and merge multiple smaller vertex feature matrices into larger ones, enabling a single kernel to process larger data chunks. This efficiently exploits data locality, reduces the kernel launch overhead, and improves overall GPU utilization. From the workflow perspective, we sophisticatedly offload the construction of semantic graphs from GPU to CPU to reduce the number of CUDA…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Brain Tumor Detection and Classification