Scaling Graph Transformers: A Comparative Study of Sparse and Dense Attention

Leon Dimitrov

arXiv:2508.17175·cs.LG·August 26, 2025

Scaling Graph Transformers: A Comparative Study of Sparse and Dense Attention

Leon Dimitrov

PDF

TL;DR

This paper compares dense and sparse attention mechanisms in graph transformers, analyzing their trade-offs, challenges, and suitable use cases to improve understanding of their effectiveness in capturing long-range dependencies.

Contribution

It provides a comprehensive comparison of dense and sparse attention in graph transformers, highlighting their respective advantages, limitations, and guiding principles for application.

Findings

01

Sparse attention reduces computational cost compared to dense attention.

02

Dense attention captures long-range dependencies more effectively.

03

Trade-offs depend on graph size and task complexity.

Abstract

Graphs have become a central representation in machine learning for capturing relational and structured data across various domains. Traditional graph neural networks often struggle to capture long-range dependencies between nodes due to their local structure. Graph transformers overcome this by using attention mechanisms that allow nodes to exchange information globally. However, there are two types of attention in graph transformers: dense and sparse. In this paper, we compare these two attention mechanisms, analyze their trade-offs, and highlight when to use each. We also outline current challenges and problems in designing attention for graph transformers.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.