Graph Transformers for Large Graphs
Vijay Prakash Dwivedi, Yozen Liu, Anh Tuan Luu, Xavier Bresson, Neil, Shah, Tong Zhao

TL;DR
This paper introduces LargeGT, a scalable graph transformer architecture that efficiently handles large graphs by combining local sampling with global attention, achieving significant speedups and performance improvements on large benchmarks.
Contribution
The work presents a novel scalable graph transformer with a fast neighborhood sampling method and a hybrid local-global attention mechanism for large-scale graph learning.
Findings
3x speedup on large graph benchmarks
16.8% performance gain on ogbn-products and snap-patents
Scales to graphs with millions of nodes
Abstract
Transformers have recently emerged as powerful neural networks for graph learning, showcasing state-of-the-art performance on several graph property prediction tasks. However, these results have been limited to small-scale graphs, where the computational feasibility of the global attention mechanism is possible. The next goal is to scale up these architectures to handle very large graphs on the scale of millions or even billions of nodes. With large-scale graphs, global attention learning is proven impractical due to its quadratic complexity w.r.t. the number of nodes. On the other hand, neighborhood sampling techniques become essential to manage large graph sizes, yet finding the optimal trade-off between speed and accuracy with sampling techniques remains challenging. This work advances representation learning on single large-scale graphs with a focus on identifying model…
Peer Reviews
Decision·Submitted to ICLR 2024
1. The proposed neighbor sampling intuitively improves the model accuracy by getting information at most 4-hop away. 2. Extensive experiments are performed. 3. The writing of the proposed method is very clear.
1. The mechanism of why LargeGT runs faster than baselines like GOAT is unclear. Since the proposed neighbor sampling has a bigger input matrix than a simple 2-hop neighbor sampling method, does it run longer than the traditional method? 2. The runtime highly depends on the hyperparameter $K$, which is the number of nodes for sampling. Authors need to provide a fair and solid comparison with the traditional 2-hop neighbor sampling method. 3. Experiment performances are not explained well (see qu
* The model's performance is thoroughly validated on large-scale graphs, demonstrating sufficient workload. * Exploring base model architectures on graphs is a very valuable endeavor.
* The efficiency analysis is incorrect. In Algorithm 1, it is required to gather 1/2-degree neighbors for each node, and then select k nodes. The process of selecting nodes is O(K), but if the graph is relatively dense, the complexity of gathering second-degree neighbors is O(N^2). * In Algorithm 1, some nodes are sampled with replacement, while some are sampled without replacement. It is uncertain whether this will introduce bias in the sampling. * It lacks some key baselines such as SGC[1], SI
- The authors propose a framework that leverage recent advances in graph transformer models, and address a critical challenge that limits the scalability of existing approaches, both MPNNs and GTs. - The introduction provides a great overview of the current challenges for large-scale graph learning, and does a great job at comparing MPNNs and GTs, while setting stage for key concepts like neighborhood sampling.
1. Baselines: LargeGT is compared to "constrained versions" of various baselines, notably all models are constrained to 2 hops only, while LargeGT has access to 4-hops worth of neighbors (in the local module). Including the non-constrained versions of these same baselines is critical for evaluation, even if they are more computationally demanding. Currently it is unclear whether adopting LargeGT leads to lower performance compared to state-of-the-art methods, at the expense of computational effi
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Recommender Systems and Techniques · Smart Cities and Technologies
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Adam · Residual Connection · Layer Normalization
