Scalable and Adaptive Parallel Training of Graph Transformer on Large Graphs
Jun-Liang Lin, Kamesh Madduri, Mahmut Taylan Kandemir

TL;DR
This paper presents a distributed training framework for graph transformers that adapts to graph and hardware characteristics, significantly improving scalability, speed, and memory efficiency on large graphs.
Contribution
It introduces an automatic parallelization strategy and distributed sparse operations, enabling scalable training of graph transformers on large graphs across multiple GPUs.
Findings
Accelerates sparse graph attention by up to 3.8x.
Reduces memory consumption by 78% compared to state-of-the-art.
Achieves up to 6x speedup on large graph benchmarks with 8 GPUs.
Abstract
Graph foundation models have demonstrated remarkable adaptability across diverse downstream tasks through large-scale pretraining on graphs. However, existing implementations of the backbone model, graph transformers, are typically limited to single-GPU systems, leading to long training times or out-of-memory issues on large graphs. Moreover, parallelizing graph transformer training over the full graph is challenging, as efficiency depends heavily on both the graph structure and system characteristics, such as bandwidth and memory capacity. In this work, we introduce a distributed training framework for graph transformers, which automatically selects and optimizes parallelization strategies based on the graph structure and hardware configuration. With our implementation of distributed sparse operations, we accelerate sparse graph attention by up to 3.8x and reduce memory consumption…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
