CaPGNN: Optimizing Parallel Graph Neural Network Training with Joint Caching and Resource-Aware Graph Partitioning
Xianfeng Song, Yi Zou, Zheng Shi

TL;DR
CaPGNN is a novel framework that significantly improves the efficiency of parallel GNN training on single-server multi-GPU systems by joint caching, resource-aware graph partitioning, and overlapping computation with communication.
Contribution
The paper introduces CaPGNN, a comprehensive framework combining adaptive caching and heuristic graph partitioning to optimize parallel GNN training efficiency and communication costs.
Findings
Training efficiency improved by up to 18.98x
Communication costs reduced by up to 99%
Maintains or improves accuracy in various scenarios
Abstract
Graph-structured data is ubiquitous in the real world, and Graph Neural Networks (GNNs) have become increasingly popular in various fields due to their ability to process such irregular data directly. However, as data scale, GNNs become inefficient. Although parallel training offers performance improvements, increased communication costs often offset these advantages. To address this, this paper introduces CaPGNN, a novel parallel full-batch GNN training framework on single-server with multi-GPU. Firstly, considering the fact that the number of remote vertices in a partition is often greater than or equal to the number of local vertices and there may exist many duplicate vertices, we propose a joint adaptive caching algorithm that leverages both CPU and GPU memory, integrating lightweight cache update and prefetch techniques to effectively reduce redundant communication costs.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
