Superlinear Multi-Step Attention
Yufeng Huang

TL;DR
This paper introduces Superlinear attention, a novel multi-step attention architecture that reduces complexity for long sequences while maintaining access to all tokens, enabling efficient processing of very long contexts.
Contribution
It presents a fully trainable multi-step attention mechanism with subquadratic complexity, combining span search and attention, and demonstrates its feasibility and initial effectiveness on long-context tasks.
Findings
Achieves $O(L^{1+1/N})$ complexity with multi-step search
Demonstrates strong performance on long-context tasks up to 256K tokens
Attains high decoding throughput on large models at long sequence lengths
Abstract
In this paper, we propose \textbf{Superlinear attention}, a fully trainable multi-step attention architecture that achieves subquadratic complexity for long sequences while preserving \textbf{random context access} (a.k.a.\ structural non-exclusion): no eligible token position is structurally excluded from being selected for attention. Superlinear attention reformulates standard causal self-attention as a multi-step search problem with steps, yielding an overall complexity of . To illustrate the architecture, we present a baseline implementation, which is algorithmically analogous to standard jump search. In this instantiation, the first step performs span-search to select relevant spans of the sequence, and the second step applies span-attention (standard attention restricted to the selected spans). In an upscaled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Stochastic Gradient Optimization Techniques
