TorchGT: A Holistic System for Large-scale Graph Transformer Training

Meng Zhang; Jie Sun; Qinghao Hu; Peng Sun; Zeke Wang; Yonggang Wen,; Tianwei Zhang

arXiv:2407.14106·cs.DC·July 22, 2024

TorchGT: A Holistic System for Large-scale Graph Transformer Training

Meng Zhang, Jie Sun, Qinghao Hu, Peng Sun, Zeke Wang, Yonggang Wen,, Tianwei Zhang

PDF

TL;DR

TorchGT is a comprehensive system that enables efficient, scalable, and accurate training of large-scale graph transformers, overcoming previous limitations in computation, scalability, and quality on massive graphs.

Contribution

It introduces TorchGT, a novel system with multi-level optimizations including Dual-interleaved Attention, Cluster-aware Graph Parallelism, and Elastic Computation Reformation for large-scale graph transformer training.

Findings

01

Boosts training speed by up to 62.7x

02

Supports graph sequence lengths up to 1 million

03

Achieves scalable and accurate graph transformer training on large graphs

Abstract

Graph Transformer is a new architecture that surpasses GNNs in graph learning. While there emerge inspiring algorithm advancements, their practical adoption is still limited, particularly on real-world graphs involving up to millions of nodes. We observe existing graph transformers fail on large-scale graphs mainly due to heavy computation, limited scalability and inferior model quality. Motivated by these observations, we propose TorchGT, the first efficient, scalable, and accurate graph transformer training system. TorchGT optimizes training at different levels. At algorithm level, by harnessing the graph sparsity, TorchGT introduces a Dual-interleaved Attention which is computation-efficient and accuracy-maintained. At runtime level, TorchGT scales training across workers with a communication-light Cluster-aware Graph Parallelism. At kernel level, an Elastic Computation Reformation…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Laplacian EigenMap · Residual Connection · Adam · Dropout · Laplacian Positional Encodings · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer