What Improves the Generalization of Graph Transformers? A Theoretical   Dive into the Self-attention and Positional Encoding

Hongkang Li; Meng Wang; Tengfei Ma; Sijia Liu; Zaixi Zhang; Pin-Yu; Chen

arXiv:2406.01977·cs.LG·June 5, 2024·1 cites

What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding

Hongkang Li, Meng Wang, Tengfei Ma, Sijia Liu, Zaixi Zhang, Pin-Yu, Chen

PDF

Open Access

TL;DR

This paper provides a theoretical analysis of shallow Graph Transformers, revealing how self-attention and positional encoding improve generalization by promoting sparsity and core neighborhoods, supported by empirical validation.

Contribution

It offers the first theoretical characterization of sample complexity and convergence for shallow Graph Transformers with self-attention and positional encoding.

Findings

01

Self-attention and positional encoding promote sparsity in attention maps.

02

Theoretical sample complexity depends on discriminative node fraction.

03

Empirical results validate the theoretical insights on benchmarks.

Abstract

Graph Transformers, which incorporate self-attention and positional encoding, have recently emerged as a powerful architecture for various graph learning tasks. Despite their impressive performance, the complex non-convex interactions across layers and the recursive graph structure have made it challenging to establish a theoretical foundation for learning and generalization. This study introduces the first theoretical investigation of a shallow Graph Transformer for semi-supervised node classification, comprising a self-attention layer with relative positional encoding and a two-layer perceptron. Focusing on a graph data model with discriminative nodes that determine node labels and non-discriminative nodes that are class-irrelevant, we characterize the sample complexity required to achieve a desirable generalization error by training with stochastic gradient descent (SGD). This paper…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Visualization and Analytics · Cognitive Science and Mapping · Constraint Satisfaction and Optimization

MethodsAttention Is All You Need · Laplacian EigenMap · Laplacian Positional Encodings · Softmax · Layer Normalization · Graph Transformer · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam