What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding
Hongkang Li, Meng Wang, Tengfei Ma, Sijia Liu, Zaixi Zhang, Pin-Yu, Chen

TL;DR
This paper provides a theoretical analysis of shallow Graph Transformers, revealing how self-attention and positional encoding improve generalization by promoting sparsity and core neighborhoods, supported by empirical validation.
Contribution
It offers the first theoretical characterization of sample complexity and convergence for shallow Graph Transformers with self-attention and positional encoding.
Findings
Self-attention and positional encoding promote sparsity in attention maps.
Theoretical sample complexity depends on discriminative node fraction.
Empirical results validate the theoretical insights on benchmarks.
Abstract
Graph Transformers, which incorporate self-attention and positional encoding, have recently emerged as a powerful architecture for various graph learning tasks. Despite their impressive performance, the complex non-convex interactions across layers and the recursive graph structure have made it challenging to establish a theoretical foundation for learning and generalization. This study introduces the first theoretical investigation of a shallow Graph Transformer for semi-supervised node classification, comprising a self-attention layer with relative positional encoding and a two-layer perceptron. Focusing on a graph data model with discriminative nodes that determine node labels and non-discriminative nodes that are class-irrelevant, we characterize the sample complexity required to achieve a desirable generalization error by training with stochastic gradient descent (SGD). This paper…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Visualization and Analytics · Cognitive Science and Mapping · Constraint Satisfaction and Optimization
MethodsAttention Is All You Need · Laplacian EigenMap · Laplacian Positional Encodings · Softmax · Layer Normalization · Graph Transformer · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam
