BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and   Preprocessing

Tianfeng Liu (1; 3); Yangrui Chen (2; 3); Dan Li (1); Chuan Wu; (2); Yibo Zhu (3); Jun He (3); Yanghua Peng (3); Hongzheng Chen (3; 4),; Hongzhi Chen (3); Chuanxiong Guo (3) ((1) Tsinghua University; (2) The; University of Hong Kong; (3) ByteDance; (4) Cornell University)

arXiv:2112.08541·cs.LG·December 17, 2021·21 cites

BGL: GPU-Efficient GNN Training by Optimizing Graph Data I/O and Preprocessing

Tianfeng Liu (1, 3), Yangrui Chen (2, 3), Dan Li (1), Chuan Wu, (2), Yibo Zhu (3), Jun He (3), Yanghua Peng (3), Hongzheng Chen (3, 4),, Hongzhi Chen (3), Chuanxiong Guo (3) ((1) Tsinghua University, (2) The, University of Hong Kong, (3) ByteDance, (4) Cornell University)

PDF

Open Access

TL;DR

This paper introduces BGL, a GPU-efficient distributed GNN training system that optimizes data I/O and preprocessing, significantly improving training speed on large graphs by reducing bottlenecks in data preparation.

Contribution

BGL presents novel caching, graph partitioning, and resource management techniques to enhance GPU-based GNN training efficiency on large-scale graphs.

Findings

01

BGL achieves an average of 20.68x speedup over existing systems.

02

The dynamic cache engine reduces feature retrieving traffic effectively.

03

Optimized graph partitioning minimizes cross-partition communication.

Abstract

Graph neural networks (GNNs) have extended the success of deep neural networks (DNNs) to non-Euclidean graph data, achieving ground-breaking performance on various tasks such as node classification and graph property prediction. Nonetheless, existing systems are inefficient to train large graphs with billions of nodes and edges with GPUs. The main bottlenecks are the process of preparing data for GPUs - subgraph sampling and feature retrieving. This paper proposes BGL, a distributed GNN training system designed to address the bottlenecks with a few key ideas. First, we propose a dynamic cache engine to minimize feature retrieving traffic. By a co-design of caching policy and the order of sampling, we find a sweet spot of low overhead and high cache hit ratio. Second, we improve the graph partition algorithm to reduce cross-partition communication during subgraph sampling. Finally,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Graph Theory and Algorithms · Caching and Content Delivery