Graph Transformers Dream of Electric Flow

Xiang Cheng; Lawrence Carin; Suvrit Sra

arXiv:2410.16699·cs.LG·March 4, 2025

Graph Transformers Dream of Electric Flow

Xiang Cheng, Lawrence Carin, Suvrit Sra

PDF

Open Access 3 Reviews

TL;DR

This paper demonstrates that linear Transformers can theoretically and empirically implement classical graph algorithms like electric flow and eigenvector decomposition, and can learn effective positional encodings in real-world tasks.

Contribution

It provides explicit configurations for implementing graph algorithms with linear Transformers and analyzes their errors, advancing understanding of Transformers on graph data.

Findings

01

Linear Transformers can implement electric flow and eigenvector algorithms.

02

Transformers learn more effective positional encodings than Laplacian eigenvectors.

03

Experimental results support theoretical claims on synthetic and real-world data.

Abstract

We show theoretically and empirically that the linear Transformer, when applied to graph data, can implement algorithms that solve canonical problems such as electric flow and eigenvector decomposition. The Transformer has access to information on the input graph only via the graph's incidence matrix. We present explicit weight configurations for implementing each algorithm, and we bound the constructed Transformers' errors by the errors of the underlying algorithms. Our theoretical findings are corroborated by experiments on synthetic data. Additionally, on a real-world molecular regression task, we observe that the linear Transformer is capable of learning a more effective positional encoding than the default one based on Laplacian eigenvectors. Our work is an initial step towards elucidating the inner-workings of the Transformer for graph data. Code is available at…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 5Confidence 4

Strengths

**Originality**: This paper introduces a novel use of a linear Transformer, to perform core graph algorithms like electric flow and eigenvector decomposition without explicit positional encodings. **Quality**: The paper offers rigorous theoretical analysis with explicit weight constructions and error bounds for each algorithm. **Clarity**: The paper is well-organized, clearly written, and enjoyable to read.

Weaknesses

1. Since the paper’s contribution is primarily theoretical, providing a proof sketch under the main results would be highly beneficial. Additionally, a more detailed description of the weight matrices would enhance clarity. 2. The theoretical results are not particularly surprising given the use of a linear Transformer. Could these results also apply to GNNs? 3. The practical impact of the proposed approach is unclear, as the empirical results are limited compared to numerous existing works. F

Reviewer 02Rating 6Confidence 4

Strengths

- The idea is novel and appealing: using a transformer to compute the solution to a graph problem as part of the latent vectors in its node output representations adds another level of applicability of transformers well beyond language understanding or generation. - Lemmas are well organized, follow similar themes and the narrative is smooth and clear.

Weaknesses

- Removing the nonlinear softmax terms from standard transformer architecture, facilitates analysis but severely impacts the power of the model. - Complexity of the approach is prohibitive: it can be O(n^4) and this explains their experimentation with very small synthetic graphs. Parameter efficient implementation is promising, but still the original idea is far from being scalable and thus practically testable beyond a couple of tens' of graph nodes.

Reviewer 03Rating 6Confidence 4

Strengths

The results are interesting. I particularly appreciated the structure of the paper, in which experimental results are shown after their corresponding theoretical claims instead of having them all the end, which highlights the connection between the presented theory and the experimental results.

Weaknesses

W1. Generally speaking, I think the impact of the paper tends to be a bit limited. This is especially true because the considered architecture (linear Transformers), and in particular its variant which includes an L2 normalization, is not widely used. The main implication I can see is to use the variant of the linear transformer in place of existing predefined positional encodings, and therefore as an additional component of a bigger architecture. W2. I find a bit confusing that the authors cl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Graph Theory and Algorithms · Neural Networks and Applications

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Residual Connection · Position-Wise Feed-Forward Layer · Dense Connections · Softmax · Multi-Head Attention · Adam · Dropout