Linear Transformer Topological Masking with Graph Random Features

Isaac Reid; Kumar Avinava Dubey; Deepali Jain; Will Whitney; Amr; Ahmed; Joshua Ainslie; Alex Bewley; Mithun Jacob; Aranyak Mehta; David; Rendleman; Connor Schenck; Richard E. Turner; Ren\'e Wagner; Adrian Weller,; Krzysztof Choromanski

arXiv:2410.03462·cs.LG·October 16, 2024

Linear Transformer Topological Masking with Graph Random Features

Isaac Reid, Kumar Avinava Dubey, Deepali Jain, Will Whitney, Amr, Ahmed, Joshua Ainslie, Alex Bewley, Mithun Jacob, Aranyak Mehta, David, Rendleman, Connor Schenck, Richard E. Turner, Ren\'e Wagner, Adrian Weller,, Krzysztof Choromanski

PDF

Open Access 1 Video 3 Reviews

TL;DR

This paper introduces a learnable, graph-structured topological masking method for linear transformers, leveraging graph random features to efficiently incorporate graph topology with strong theoretical guarantees, enabling scalable processing of large graph data.

Contribution

It proposes a novel parameterization of topological masks using graph random features, compatible with linear attention and providing theoretical concentration bounds.

Findings

01

Achieves $ ext{O}(N)$ complexity for large graphs.

02

Provides strong performance improvements on image and point cloud tasks.

03

Supports graphs with over 30,000 nodes.

Abstract

When training transformers on graph-structured data, incorporating information about the underlying topology is crucial for good performance. Topological masking, a type of relative position encoding, achieves this by upweighting or downweighting attention depending on the relationship between the query and keys in a graph. In this paper, we propose to parameterise topological masks as a learnable function of a weighted adjacency matrix -- a novel, flexible approach which incorporates a strong structural inductive bias. By approximating this mask with graph random features (for which we prove the first known concentration bounds), we show how this can be made fully compatible with linear attention, preserving $O (N)$ time and space complexity with respect to the number of input tokens. The fastest previous alternative was $O (N lo g N)$ and only suitable for specific…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 5

Strengths

1. The paper provides strong theoretical foundations with proven concentration bounds and complexity guarantees for GRFs. 2. The method shows concrete performance improvements on real-world tasks and scales to large problems (>30k nodes) that would be intractable with quadratic approaches. 3. The approach can be implemented with both symmetric and asymmetric GRFs, offering different trade-offs between computational efficiency and variance in mask estimation. 4. The experiments cover diverse appl

Weaknesses

The paper's central claim of O(N) complexity relies critically on the assertion that Graph Random Features (GRFs) have O(1) sparsity. This claim is mathematically incorrect for several reasons: In Lemma 3.2, while the result doesn’t show an N term, it is still implicitly dependent on the size of the graph. O(1) complexity implies that your non-zero entries per row vector $\hat{\phi}G(vi)$ are bounded by a constant independent of input size. The bound in Lemma 3.2 is still dependent on multiple p

Reviewer 02Rating 8Confidence 2

Strengths

1. The prposed method shares $O(n)$ time complexity and suitable for the relatively large scale input.

Weaknesses

see question.

Reviewer 03Rating 6Confidence 3

Strengths

1. Their method is the first to achieve $\mathcal{O}(N)$-time complexity for computing masked attention for general graphs, $N$ being the number of vertices. 2. The paper provides the first known concentration bounds for GRFs and rigorous sparsity guarantees. These theoretical insights are valuable, potentially extending beyond transformers to other domains that rely on scalable graph-based representations. 3. Their method demonstrates improved predictive performance in various learning tasks.

Weaknesses

1. Dense and unclear presentation: - While the method is theoretically sound, the presentation is mathematically dense and lacks clear explanations. This may pose a barrier to readers, particularly those less familiar with GRFs. In particular, the technical exposition in lines 184–254 is notation-heavy and unclear. - Algorithmic descriptions, such as those in Algorithm 1, are highly abstract and may be difficult to follow. Without clearer explanations, the accessibility of the pap

Videos

Linear Transformer Topological Masking with Graph Random Features· slideslive

Taxonomy

TopicsImage Retrieval and Classification Techniques

MethodsSoftmax · Attention Is All You Need