TorchGT: A Holistic System for Large-scale Graph Transformer Training
Meng Zhang, Jie Sun, Qinghao Hu, Peng Sun, Zeke Wang, Yonggang Wen,, Tianwei Zhang

TL;DR
TorchGT is a comprehensive system that enables efficient, scalable, and accurate training of large-scale graph transformers, overcoming previous limitations in computation, scalability, and quality on massive graphs.
Contribution
It introduces TorchGT, a novel system with multi-level optimizations including Dual-interleaved Attention, Cluster-aware Graph Parallelism, and Elastic Computation Reformation for large-scale graph transformer training.
Findings
Boosts training speed by up to 62.7x
Supports graph sequence lengths up to 1 million
Achieves scalable and accurate graph transformer training on large graphs
Abstract
Graph Transformer is a new architecture that surpasses GNNs in graph learning. While there emerge inspiring algorithm advancements, their practical adoption is still limited, particularly on real-world graphs involving up to millions of nodes. We observe existing graph transformers fail on large-scale graphs mainly due to heavy computation, limited scalability and inferior model quality. Motivated by these observations, we propose TorchGT, the first efficient, scalable, and accurate graph transformer training system. TorchGT optimizes training at different levels. At algorithm level, by harnessing the graph sparsity, TorchGT introduces a Dual-interleaved Attention which is computation-efficient and accuracy-maintained. At runtime level, TorchGT scales training across workers with a communication-light Cluster-aware Graph Parallelism. At kernel level, an Elastic Computation Reformation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Laplacian EigenMap · Residual Connection · Adam · Dropout · Laplacian Positional Encodings · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer
