LSM-GNN: Large-scale Storage-based Multi-GPU GNN Training by Optimizing Data Transfer Scheme
Jeongmin Brian Park, Kun Wu, Vikram Sharma Mailthody, Zaid Quresh, Scott Mahlke, Wen-mei Hwu

TL;DR
LSM-GNN is a storage-based multi-GPU framework that optimizes data transfer and caching to efficiently train large-scale GNNs, outperforming traditional partitioning methods.
Contribution
It introduces a novel communication layer, hybrid eviction policy, and prefetching mechanism to reduce overheads and improve performance in multi-GPU GNN training.
Findings
LSM-GNN achieves up to 3.75x speedup over baseline.
Single-node two-GPU setup outperforms multi-node configurations.
The framework effectively manages cache and prefetching to handle large-scale GNNs.
Abstract
Graph Neural Networks (GNNs) are widely used today in recommendation systems, fraud detection, and node/link classification tasks. Real world GNNs continue to scale in size and require a large memory footprint for storing graphs and embeddings that often exceed the memory capacities of the target GPUs used for training. To address limited memory capacities, traditional GNN training approaches use graph partitioning and sharding techniques to scale up across multiple GPUs within a node and/or scale out across multiple nodes. However, this approach suffers from the high computational costs of graph partitioning algorithms and inefficient communication across GPUs. To address these overheads, we propose Large-scale Storage-based Multi-GPU GNN framework (LSM-GNN), a storage-based approach to train GNN models that utilizes a novel communication layer enabling GPU software caches to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
