Longer Attention Span: Increasing Transformer Context Length with Sparse   Graph Processing Techniques

Nathaniel Tomczak; Sanmukh Kuppannagari

arXiv:2502.01659·cs.LG·February 10, 2025

Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques

Nathaniel Tomczak, Sanmukh Kuppannagari

PDF

Open Access 1 Repo

TL;DR

This paper introduces a graph-based approach to implement sparse attention in transformers, enabling longer context processing with significant speedups and the ability to handle sequences up to 160 million tokens.

Contribution

It proposes a graph computing framework for sparse attention, achieving work-optimal algorithms that significantly extend sequence length capabilities in transformers.

Findings

01

Achieves substantial speedups over existing attention methods.

02

Enables processing of sequences up to 160 million tokens.

03

Demonstrates work-optimality of the proposed algorithms.

Abstract

Transformers have demonstrated great success in numerous domains including natural language processing and bioinformatics. This success stems from the use of the attention mechanism by these models in order to represent and propagate pairwise interactions between individual tokens of sequential data. However, the primary limitation of this operation is its quadratic memory and time complexity in relation to the input's context length - the length of a sequence over which the interactions need to be captured. This significantly limits the length of sequences that can be inferred upon by these models. Extensive research has been conducted to reduce the number of pairwise interactions to sub-quadratic in relation to the context length by introducing sparsity into the attention mechanism through the development of sparse attention masks. However, efficient implementations that achieve "true…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

KLab-AI3/Graph-Processing-Attention-IPDPS-2025
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks

MethodsSoftmax · Attention Is All You Need