DistGNN-MB: Distributed Large-Scale Graph Neural Network Training on x86   via Minibatch Sampling

Md Vasimuddin; Ramanarayan Mohanty; Sanchit Misra; Sasikanth Avancha

arXiv:2211.06385·cs.LG·November 14, 2022·1 cites

DistGNN-MB: Distributed Large-Scale Graph Neural Network Training on x86 via Minibatch Sampling

Md Vasimuddin, Ramanarayan Mohanty, Sanchit Misra, Sasikanth Avancha

PDF

Open Access

TL;DR

DistGNN-MB introduces a novel distributed training method for large-scale graph neural networks that significantly reduces training time by employing a historical embedding cache and overlapping compute with communication.

Contribution

It presents a new distributed GNN training approach with a historical embedding cache and compute-communication overlap, enabling efficient training on billion-scale graphs.

Findings

01

Trains GraphSAGE in 2 seconds per epoch on 32 nodes.

02

Achieves 5.2x speedup over DistDGL for GraphSAGE.

03

Scales GAT training by 17.2x from 2 to 32 nodes.

Abstract

Training Graph Neural Networks, on graphs containing billions of vertices and edges, at scale using minibatch sampling poses a key challenge: strong-scaling graphs and training examples results in lower compute and higher communication volume and potential performance loss. DistGNN-MB employs a novel Historical Embedding Cache combined with compute-communication overlap to address this challenge. On a 32-node (64-socket) cluster of $3^{r d}$ generation Intel Xeon Scalable Processors with 36 cores per socket, DistGNN-MB trains 3-layer GraphSAGE and GAT models on OGBN-Papers100M to convergence with epoch times of 2 seconds and 4.9 seconds, respectively, on 32 compute nodes. At this scale, DistGNN-MB trains GraphSAGE 5.2x faster than the widely-used DistDGL. DistGNN-MB trains GraphSAGE and GAT 10x and 17.2x faster, respectively, as compute nodes scale from 2 to 32.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Advanced Memory and Neural Computing · Ferroelectric and Negative Capacitance Devices

MethodsDistDGL · GraphSAGE · Graph Attention Network