Plexus: Taming Billion-edge Graphs with 3D Parallel Full-graph GNN Training
Aditya K. Ranjan, Siddharth Singh, Cunyang Wei, Abhinav Bhatele

TL;DR
Plexus introduces a 3D parallel full-graph GNN training method that efficiently scales to billion-edge graphs, significantly reducing training time and outperforming previous approaches on large GPU clusters.
Contribution
The paper presents a novel 3D parallel approach for full-graph GNN training, including load balancing and a performance model, enabling scalable training on billion-edge graphs.
Findings
Achieves 2.3-12.5x speedup over prior methods.
Reduces training time by up to 54.2x on large GPU clusters.
Successfully scales to billion-edge graphs with up to 2048 GPUs.
Abstract
Graph neural networks (GNNs) leverage the connectivity and structure of real-world graphs to learn intricate properties and relationships between nodes. Many real-world graphs exceed the memory capacity of a GPU due to their sheer size, and training GNNs on such graphs requires techniques such as mini-batch sampling to scale. The alternative approach of distributed full-graph training suffers from high communication overheads and load imbalance due to the irregular structure of graphs. We propose a three-dimensional (3D) parallel approach for full-graph training that tackles these issues and scales to billion-edge graphs. In addition, we introduce optimizations such as a double permutation scheme for load balancing, and a performance model to predict the optimal 3D configuration of our parallel implementation -- Plexus. We evaluate Plexus on six different graph datasets and show scaling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
