Predicting Attention Sparsity in Transformers
Marcos Treviso, Ant\'onio G\'ois, Patrick Fernandes, Erick Fonseca,, Andr\'e F. T. Martins

TL;DR
Sparsefinder predicts the sparsity pattern of entmax attention in transformers, enabling more efficient computation by reducing complexity while maintaining accuracy, and provides a detailed analysis of sparsity-recall tradeoffs.
Contribution
We introduce Sparsefinder, a novel model that predicts attention sparsity patterns in transformers, facilitating efficient sparse attention computation.
Findings
Sparsefinder effectively predicts attention sparsity patterns.
Our methods achieve favorable tradeoffs between sparsity and recall.
Extensive analysis guides future benchmarks for sparse attention models.
Abstract
Transformers' quadratic complexity with respect to the input sequence length has motivated a body of work on efficient sparse approximations to softmax. An alternative path, used by entmax transformers, consists of having built-in exact sparse attention; however this approach still requires quadratic computation. In this paper, we propose Sparsefinder, a simple model trained to identify the sparsity pattern of entmax attention before computing it. We experiment with three variants of our method, based on distances, quantization, and clustering, on two tasks: machine translation (attention in the decoder) and masked language modeling (encoder-only). Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph. This allows for detailed comparison between different models along their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsParallel Computing and Optimization Techniques · Ferroelectric and Negative Capacitance Devices · Topic Modeling
