Linear Transformer Topological Masking with Graph Random Features
Isaac Reid, Kumar Avinava Dubey, Deepali Jain, Will Whitney, Amr, Ahmed, Joshua Ainslie, Alex Bewley, Mithun Jacob, Aranyak Mehta, David, Rendleman, Connor Schenck, Richard E. Turner, Ren\'e Wagner, Adrian Weller,, Krzysztof Choromanski

TL;DR
This paper introduces a learnable, graph-structured topological masking method for linear transformers, leveraging graph random features to efficiently incorporate graph topology with strong theoretical guarantees, enabling scalable processing of large graph data.
Contribution
It proposes a novel parameterization of topological masks using graph random features, compatible with linear attention and providing theoretical concentration bounds.
Findings
Achieves $ ext{O}(N)$ complexity for large graphs.
Provides strong performance improvements on image and point cloud tasks.
Supports graphs with over 30,000 nodes.
Abstract
When training transformers on graph-structured data, incorporating information about the underlying topology is crucial for good performance. Topological masking, a type of relative position encoding, achieves this by upweighting or downweighting attention depending on the relationship between the query and keys in a graph. In this paper, we propose to parameterise topological masks as a learnable function of a weighted adjacency matrix -- a novel, flexible approach which incorporates a strong structural inductive bias. By approximating this mask with graph random features (for which we prove the first known concentration bounds), we show how this can be made fully compatible with linear attention, preserving time and space complexity with respect to the number of input tokens. The fastest previous alternative was and only suitable for specific…
Peer Reviews
Decision·ICLR 2025 Poster
1. The paper provides strong theoretical foundations with proven concentration bounds and complexity guarantees for GRFs. 2. The method shows concrete performance improvements on real-world tasks and scales to large problems (>30k nodes) that would be intractable with quadratic approaches. 3. The approach can be implemented with both symmetric and asymmetric GRFs, offering different trade-offs between computational efficiency and variance in mask estimation. 4. The experiments cover diverse appl
The paper's central claim of O(N) complexity relies critically on the assertion that Graph Random Features (GRFs) have O(1) sparsity. This claim is mathematically incorrect for several reasons: In Lemma 3.2, while the result doesn’t show an N term, it is still implicitly dependent on the size of the graph. O(1) complexity implies that your non-zero entries per row vector $\hat{\phi}G(vi)$ are bounded by a constant independent of input size. The bound in Lemma 3.2 is still dependent on multiple p
1. The prposed method shares $O(n)$ time complexity and suitable for the relatively large scale input.
see question.
1. Their method is the first to achieve $\mathcal{O}(N)$-time complexity for computing masked attention for general graphs, $N$ being the number of vertices. 2. The paper provides the first known concentration bounds for GRFs and rigorous sparsity guarantees. These theoretical insights are valuable, potentially extending beyond transformers to other domains that rely on scalable graph-based representations. 3. Their method demonstrates improved predictive performance in various learning tasks.
1. Dense and unclear presentation: - While the method is theoretically sound, the presentation is mathematically dense and lacks clear explanations. This may pose a barrier to readers, particularly those less familiar with GRFs. In particular, the technical exposition in lines 184–254 is notation-heavy and unclear. - Algorithmic descriptions, such as those in Algorithm 1, are highly abstract and may be difficult to follow. Without clearer explanations, the accessibility of the pap
Videos
Taxonomy
TopicsImage Retrieval and Classification Techniques
MethodsSoftmax · Attention Is All You Need
