Predicting Attention Sparsity in Transformers

Marcos Treviso; Ant\'onio G\'ois; Patrick Fernandes; Erick Fonseca,; Andr\'e F. T. Martins

arXiv:2109.12188·cs.CL·April 22, 2022

Predicting Attention Sparsity in Transformers

Marcos Treviso, Ant\'onio G\'ois, Patrick Fernandes, Erick Fonseca,, Andr\'e F. T. Martins

PDF

Open Access

TL;DR

Sparsefinder predicts the sparsity pattern of entmax attention in transformers, enabling more efficient computation by reducing complexity while maintaining accuracy, and provides a detailed analysis of sparsity-recall tradeoffs.

Contribution

We introduce Sparsefinder, a novel model that predicts attention sparsity patterns in transformers, facilitating efficient sparse attention computation.

Findings

01

Sparsefinder effectively predicts attention sparsity patterns.

02

Our methods achieve favorable tradeoffs between sparsity and recall.

03

Extensive analysis guides future benchmarks for sparse attention models.

Abstract

Transformers' quadratic complexity with respect to the input sequence length has motivated a body of work on efficient sparse approximations to softmax. An alternative path, used by entmax transformers, consists of having built-in exact sparse attention; however this approach still requires quadratic computation. In this paper, we propose Sparsefinder, a simple model trained to identify the sparsity pattern of entmax attention before computing it. We experiment with three variants of our method, based on distances, quantization, and clustering, on two tasks: machine translation (attention in the decoder) and masked language modeling (encoder-only). Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph. This allows for detailed comparison between different models along their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Ferroelectric and Negative Capacitance Devices · Topic Modeling