Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques
Nathaniel Tomczak, Sanmukh Kuppannagari

TL;DR
This paper introduces a graph-based approach to implement sparse attention in transformers, enabling longer context processing with significant speedups and the ability to handle sequences up to 160 million tokens.
Contribution
It proposes a graph computing framework for sparse attention, achieving work-optimal algorithms that significantly extend sequence length capabilities in transformers.
Findings
Achieves substantial speedups over existing attention methods.
Enables processing of sequences up to 160 million tokens.
Demonstrates work-optimality of the proposed algorithms.
Abstract
Transformers have demonstrated great success in numerous domains including natural language processing and bioinformatics. This success stems from the use of the attention mechanism by these models in order to represent and propagate pairwise interactions between individual tokens of sequential data. However, the primary limitation of this operation is its quadratic memory and time complexity in relation to the input's context length - the length of a sequence over which the interactions need to be captured. This significantly limits the length of sequences that can be inferred upon by these models. Extensive research has been conducted to reduce the number of pairwise interactions to sub-quadratic in relation to the context length by introducing sparsity into the attention mechanism through the development of sparse attention masks. However, efficient implementations that achieve "true…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks
MethodsSoftmax · Attention Is All You Need
