Legion: Automatically Pushing the Envelope of Multi-GPU System for   Billion-Scale GNN Training

Jie Sun; Li Su; Zuocheng Shi; Wenting Shen; Zeke Wang; Lei Wang; Jie; Zhang; Yong Li; Wenyuan Yu; Jingren Zhou; Fei Wu

arXiv:2305.16588·cs.DC·June 13, 2023·5 cites

Legion: Automatically Pushing the Envelope of Multi-GPU System for Billion-Scale GNN Training

Jie Sun, Li Su, Zuocheng Shi, Wenting Shen, Zeke Wang, Lei Wang, Jie, Zhang, Yong Li, Wenyuan Yu, Jingren Zhou, Fei Wu

PDF

Open Access 1 Repo

TL;DR

Legion is a novel multi-GPU system that enhances billion-scale GNN training efficiency through hierarchical graph partitioning, a unified cache, and adaptive cache management, enabling single-machine training of large graphs.

Contribution

This work introduces Legion, a system with innovative cache and partitioning strategies that significantly improve multi-GPU GNN training at billion scale.

Findings

01

Supports training billion-scale GNNs on a single machine

02

Outperforms existing cache-based systems on small graphs

03

Achieves higher training throughput across various datasets

Abstract

Graph neural network(GNN) has been widely applied in real-world applications, such as product recommendation in e-commerce platforms and risk control in financial management systems. Several cache-based GNN systems have been built to accelerate GNN training in a single machine with multiple GPUs. However, these systems fail to train billion-scale graphs efficiently, which is a common challenge in the industry. In this work, we propose Legion, a system that automatically pushes the envelope of multi-GPU systems for accelerating billion-scale GNN training. First, we design a hierarchical graph partitioning mechanism that significantly improves the multi-GPU cache performance. Second, we build a unified multi-GPU cache that helps to minimize the PCIe traffic incurred by caching both graph topology and features with the highest hotness. Third, we develop an automatic caching management…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rc4ml/legion
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCaching and Content Delivery · Advanced Graph Neural Networks · Machine Learning and ELM