TL;DR
LISA introduces a linear-time self-attention mechanism for sequence modeling that maintains full context and significantly improves efficiency and memory usage, enabling scalable recommendation systems.
Contribution
The paper proposes LISA, a novel linear-time self-attention method that combines the effectiveness of vanilla attention with the efficiency of sparse attention, without restrictions on sequence length.
Findings
LISA outperforms state-of-the-art efficient attention methods in accuracy.
LISA is up to 57x faster than vanilla self-attention.
LISA uses less memory, up to 78x more efficient than vanilla self-attention.
Abstract
Self-attention has become increasingly popular in a variety of sequence modeling tasks from natural language processing to recommendation, due to its effectiveness. However, self-attention suffers from quadratic computational and memory complexities, prohibiting its applications on long sequences. Existing approaches that address this issue mainly rely on a sparse attention context, either using a local window, or a permuted bucket obtained by locality-sensitive hashing (LSH) or sorting, while crucial information may be lost. Inspired by the idea of vector quantization that uses cluster centroids to approximate items, we propose LISA (LInear-time Self Attention), which enjoys both the effectiveness of vanilla self-attention and the efficiency of sparse attention. LISA scales linearly with the sequence length, while enabling full contextual attention via computing differentiable…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
