Sparser is Faster and Less is More: Efficient Sparse Attention for   Long-Range Transformers

Chao Lou; Zixia Jia; Zilong Zheng; Kewei Tu

arXiv:2406.16747·cs.CL·June 25, 2024·3 cites

Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers

Chao Lou, Zixia Jia, Zilong Zheng, Kewei Tu

PDF

Open Access

TL;DR

This paper introduces SPARSEK Attention, a sparse attention mechanism that achieves linear time complexity and constant memory, enabling efficient processing of long sequences in Transformers with improved speed and minimal fine-tuning.

Contribution

The paper presents SPARSEK Attention, a novel sparse attention method that combines a scoring network and differentiable top-k masking to reduce complexity and memory usage in long-range Transformers.

Findings

01

Outperforms previous sparse attention methods in speed and efficiency.

02

Achieves linear time complexity and constant memory during generation.

03

Easily integrates into pre-trained LLMs with minimal fine-tuning.

Abstract

Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements inherent in self-attention mechanisms. In this work, we introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome these computational and memory obstacles while maintaining performance. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query, thereby enabling gradient-based optimization. As a result, SPARSEK Attention offers linear time complexity and constant memory footprint during generation. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods and provides significant speed improvements during…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Advanced Neural Network Applications · EEG and Brain-Computer Interfaces

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings