SEA: Sparse Linear Attention with Estimated Attention Mask

Heejun Lee; Jina Kim; Jeffrey Willette; Sung Ju Hwang

arXiv:2310.01777·cs.CL·March 26, 2024

SEA: Sparse Linear Attention with Estimated Attention Mask

Heejun Lee, Jina Kim, Jeffrey Willette, Sung Ju Hwang

PDF

Open Access 1 Repo 3 Reviews

TL;DR

SEA introduces a novel sparse linear attention method that estimates and sparsifies attention matrices, achieving better performance and interpretability than previous approaches while reducing memory usage, enabling large transformer models on resource-limited devices.

Contribution

The paper presents SEA, a new method that estimates attention matrices with linear complexity and creates sparse, interpretable attention matrices, improving efficiency and performance over prior methods.

Findings

01

SEA outperforms previous linear and sparse attention methods in perplexity scores.

02

SEA uses roughly half the memory of comparable models like OPT-1.3B.

03

The approach enables large transformers to run efficiently on resource-limited devices.

Abstract

The transformer architecture has driven breakthroughs in recent years on tasks which require modeling pairwise relationships between sequential elements, as is the case in natural language understanding. However, long seqeuences pose a problem due to the quadratic complexity of the attention operation. Previous research has aimed to lower the complexity by sparsifying or linearly approximating the attention matrix. Yet, these approaches cannot straightforwardly distill knowledge from a teacher's attention matrix and often require complete retraining from scratch. Furthermore, previous sparse and linear approaches lose interpretability if they cannot produce full attention matrices. To address these challenges, we propose SEA: Sparse linear attention with an Estimated Attention mask. SEA estimates the attention matrix with linear complexity via kernel-based linear attention, then…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

Enabling faster processing of long sequences is an important research direction, and the proposed method is well-motivated. I appreciate the effort made in presenting the method, which, despite its complexity, can still be understood. The idea of combining kernel-based linear attention and sparsification is novel. On GLUE tasks, experiments show how SEA approximates full attention better than other methods while remaining competitive in terms of memory footprint. Moreover, unlike other approache

Weaknesses

- Comparison with FlashAttention [1]: It would be fair to add FlashAttention among the baselines. Especially, FlashAttention would also be competitive in terms of memory. - The method is still quite complex, making it hard to deploy. - The latency results do not show a clear advantage of the method over baselines, often being significantly slower. - The justification of the method for autoregressive language modeling is unclear. As most causal models are used for sequence generation, sampling o

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 2

Strengths

The computational complexity of the attention mechanism is a serious bottleneck and improving this to linear is very useful. The strength of the paper is that it tackles an important problem.

Weaknesses

The paper's clarity and explanation of the algorithm's functionality are lacking, making it challenging to determine its applicability, particularly with regard to pre-trained models.

Reviewer 03Rating 8· accept, good paperConfidence 3

Strengths

S1. I like the approach of a "test-time" sparse linear attention. S2. I appreciate the thorough description of the author's contributions. S3: I appreciate the contribution of a new Triton kernel for sparse operations.

Weaknesses

W1. One of the motivations of the paper is that other linear attentions cannot distill the learned attention patterns, and hence need to train from scratch. However, the authors in the paper still need to train their Performer and Decoder from scratch. I haven't seen any discussion about the inherent cost of doing that. Intuitively, it should be cheaper than training from scratch, but can you point me to the text (or elaborate in a new discussion) about how expensive it is to do this training?

Code & Models

Repositories

gmlwns2000/sea-attention
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFerroelectric and Negative Capacitance Devices · Topic Modeling · Advanced Neural Network Applications

MethodsKnowledge Distillation