SEA: Sparse Linear Attention with Estimated Attention Mask
Heejun Lee, Jina Kim, Jeffrey Willette, Sung Ju Hwang

TL;DR
SEA introduces a novel sparse linear attention method that estimates and sparsifies attention matrices, achieving better performance and interpretability than previous approaches while reducing memory usage, enabling large transformer models on resource-limited devices.
Contribution
The paper presents SEA, a new method that estimates attention matrices with linear complexity and creates sparse, interpretable attention matrices, improving efficiency and performance over prior methods.
Findings
SEA outperforms previous linear and sparse attention methods in perplexity scores.
SEA uses roughly half the memory of comparable models like OPT-1.3B.
The approach enables large transformers to run efficiently on resource-limited devices.
Abstract
The transformer architecture has driven breakthroughs in recent years on tasks which require modeling pairwise relationships between sequential elements, as is the case in natural language understanding. However, long seqeuences pose a problem due to the quadratic complexity of the attention operation. Previous research has aimed to lower the complexity by sparsifying or linearly approximating the attention matrix. Yet, these approaches cannot straightforwardly distill knowledge from a teacher's attention matrix and often require complete retraining from scratch. Furthermore, previous sparse and linear approaches lose interpretability if they cannot produce full attention matrices. To address these challenges, we propose SEA: Sparse linear attention with an Estimated Attention mask. SEA estimates the attention matrix with linear complexity via kernel-based linear attention, then…
Peer Reviews
Decision·ICLR 2024 poster
Enabling faster processing of long sequences is an important research direction, and the proposed method is well-motivated. I appreciate the effort made in presenting the method, which, despite its complexity, can still be understood. The idea of combining kernel-based linear attention and sparsification is novel. On GLUE tasks, experiments show how SEA approximates full attention better than other methods while remaining competitive in terms of memory footprint. Moreover, unlike other approache
- Comparison with FlashAttention [1]: It would be fair to add FlashAttention among the baselines. Especially, FlashAttention would also be competitive in terms of memory. - The method is still quite complex, making it hard to deploy. - The latency results do not show a clear advantage of the method over baselines, often being significantly slower. - The justification of the method for autoregressive language modeling is unclear. As most causal models are used for sequence generation, sampling o
The computational complexity of the attention mechanism is a serious bottleneck and improving this to linear is very useful. The strength of the paper is that it tackles an important problem.
The paper's clarity and explanation of the algorithm's functionality are lacking, making it challenging to determine its applicability, particularly with regard to pre-trained models.
S1. I like the approach of a "test-time" sparse linear attention. S2. I appreciate the thorough description of the author's contributions. S3: I appreciate the contribution of a new Triton kernel for sparse operations.
W1. One of the motivations of the paper is that other linear attentions cannot distill the learned attention patterns, and hence need to train from scratch. However, the authors in the paper still need to train their Performer and Decoder from scratch. I haven't seen any discussion about the inherent cost of doing that. Intuitively, it should be cheaper than training from scratch, but can you point me to the text (or elaborate in a new discussion) about how expensive it is to do this training?
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFerroelectric and Negative Capacitance Devices · Topic Modeling · Advanced Neural Network Applications
MethodsKnowledge Distillation
