# HyScale-GNN: A Scalable Hybrid GNN Training System on Single-Node   Heterogeneous Architecture

**Authors:** Yi-Chien Lin, Viktor Prasanna

arXiv: 2303.00158 · 2023-03-02

## TL;DR

HyScale-GNN introduces a scalable, hybrid GNN training system on single-node heterogeneous architectures, effectively handling large-scale graphs and outperforming multi-node systems in speed.

## Contribution

The paper presents a novel single-node GNN training system that leverages hybrid processing and dynamic resource management to scale to large graphs and improve training efficiency.

## Key findings

- Achieves up to 2.08x speedup on CPU-GPU architecture.
- Achieves up to 12.6x speedup on CPU-FPGA architecture.
- Outperforms multi-node systems like P3 and DistDGL with up to 5.27x speedup.

## Abstract

Graph Neural Networks (GNNs) have shown success in many real-world applications that involve graph-structured data. Most of the existing single-node GNN training systems are capable of training medium-scale graphs with tens of millions of edges; however, scaling them to large-scale graphs with billions of edges remains challenging. In addition, it is challenging to map GNN training algorithms onto a computation node as state-of-the-art machines feature heterogeneous architecture consisting of multiple processors and a variety of accelerators.   We propose HyScale-GNN, a novel system to train GNN models on a single-node heterogeneous architecture. HyScale- GNN performs hybrid training which utilizes both the processors and the accelerators to train a model collaboratively. Our system design overcomes the memory size limitation of existing works and is optimized for training GNNs on large-scale graphs. We propose a two-stage data pre-fetching scheme to reduce the communication overhead during GNN training. To improve task mapping efficiency, we propose a dynamic resource management mechanism, which adjusts the workload assignment and resource allocation during runtime. We evaluate HyScale-GNN on a CPU-GPU and a CPU-FPGA heterogeneous architecture. Using several large-scale datasets and two widely-used GNN models, we compare the performance of our design with a multi-GPU baseline implemented in PyTorch-Geometric. The CPU-GPU design and the CPU-FPGA design achieve up to 2.08x speedup and 12.6x speedup, respectively. Compared with the state-of-the-art large-scale multi-node GNN training systems such as P3 and DistDGL, our CPU-FPGA design achieves up to 5.27x speedup using a single node.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2303.00158/full.md

## Figures

13 figures with captions in the complete paper: https://tomesphere.com/paper/2303.00158/full.md

## References

36 references — full list in the complete paper: https://tomesphere.com/paper/2303.00158/full.md

---
Source: https://tomesphere.com/paper/2303.00158