Memory-efficient Transformers via Top-$k$ Attention

Ankit Gupta; Guy Dar; Shaya Goodman; David Ciprut; Jonathan Berant

arXiv:2106.06899·cs.CL·June 15, 2021

Memory-efficient Transformers via Top-$k$ Attention

Ankit Gupta, Guy Dar, Shaya Goodman, David Ciprut, Jonathan Berant

PDF

2 Repos

TL;DR

This paper introduces a top-$k$ attention approximation for Transformers that reduces memory usage to linear scale, is compatible with pre-trained models without extra training, and maintains near-vanilla accuracy across various tasks.

Contribution

The authors propose a simple top-$k$ attention method that is memory-efficient, compatible with existing pre-trained models, and effective without additional pre-training.

Findings

01

Memory usage is linear in input size.

02

Near-identical accuracy to vanilla attention in multiple setups.

03

Significant memory savings in feed-forward layers.

Abstract

Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. While these variants are memory and compute efficient, it is not possible to directly use them with popular pre-trained language models trained using vanilla attention, without an expensive corrective pre-training stage. In this work, we propose a simple yet highly accurate approximation for vanilla attention. We process the queries in chunks, and for each query, compute the top- $k$ scores with respect to the keys. Our approach offers several advantages: (a) its memory usage is linear in the input size, similar to linear attention variants, such as Performer and RFA (b) it is a drop-in replacement for vanilla attention that does not require any corrective pre-training, and (c) it can also lead to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Fast Attention Via Positive Orthogonal Random Features · Performer · Dropout · Layer Normalization · Byte Pair Encoding · Attention Is All You Need · Gated Linear Unit · Inverse Square Root Schedule · Refunds@Expedia|||How do I get a full refund from Expedia?