VCR-Graphormer: A Mini-batch Graph Transformer via Virtual Connections
Dongqi Fu, Zhigang Hua, Yan Xie, Jin Fang, Si Zhang, Kaan Sancak, Hao, Wu, Andrey Malevich, Jingrui He, Bo Long

TL;DR
VCR-Graphormer introduces a mini-batch graph transformer that uses personalized PageRank tokenization and virtual connections to efficiently encode complex graph information, enabling scalable and expressive graph learning.
Contribution
The paper proposes a novel mini-batch training method for graph transformers using PPR tokenization and virtual connections, improving scalability and representation capacity.
Findings
Enables effective mini-batch training of graph transformers.
Decouples feature engineering from model training.
Supports local, global, and heterophilous information encoding.
Abstract
Graph transformer has been proven as an effective graph learning method for its adoption of attention mechanism that is capable of capturing expressive representations from complex topological and feature information of graphs. Graph transformer conventionally performs dense attention (or global attention) for every pair of nodes to learn node representation vectors, resulting in quadratic computational costs that are unaffordable for large-scale graph data. Therefore, mini-batch training for graph transformers is a promising direction, but limited samples in each mini-batch can not support effective dense attention to encode informative representations. Facing this bottleneck, (1) we start by assigning each node a token list that is sampled by personalized PageRank (PPR) and then apply standard multi-head self-attention only on this list to compute its node representations. This PPR…
Peer Reviews
Decision·ICLR 2024 poster
The scalable graph transformer is a hot, interesting, and important field in our research community. Training with mini-batches is memory-efficient by nature. The experiments are extensive (but focus on homophilous datasets). The proposed method outperforms baselines by a large margin, especially for heterophilous graphs.
- The paper is hard to follow. Abuse of notations in Eq 3.1, 3.2, and 3.3 might be confusing for readers. Please formally define the operations or use well-known operations. Is {.} a set or a list? Please proper set-builder notations. Please put l (l-th step) to the name of the variable, r_u? Is the concatenation operator applied to both scalars and vectors? The cardinality of T_u in Eq 3.3 is 4 when you note it like this. Plus, it would be nice if the authors explained what insights we can see
- The presentation is easy to follow and the intuition of the approach well supported. - Types, number of datasets and baseline models are adequate for demonstrating the efficacy of the approach for the node classification task in particular. Ablation studies are very informative (Figures 3 and 4).
- The inclusion of additional graph learning tasks (edge/graph classification) would further establish the validity/generality of the token list preparation approach proposed.
1. This paper summarizes four metrics for graph tokenization methods. 2. This paper leverages existing techniques, like PPR and the graph partition method, to generate the token list for each target node. 3. This paper provides several theoretical analyses for the proposed method. 4. Empirical results on different scale datasets seem to indicate the promising performance of the proposed method.
1. Several recent studies on designing graph Transformers with node sampling or node clustering are ignored. 2. Experimental results are inefficient in demonstrating the merits of the proposed method.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInterconnection Networks and Systems · Graph Theory and Algorithms
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Absolute Position Encodings · Laplacian EigenMap · Position-Wise Feed-Forward Layer · Residual Connection · Multi-Head Attention · Softmax
