TL;DR
This paper challenges the notion that attention weights in transformer models explain model predictions, introducing efficient attention that isolates the effective components of attention matrices and demonstrates their causal role in NLP tasks.
Contribution
The paper corrects formal arguments about attention weights' explanatory relevance and introduces efficient attention, which effectively isolates and computes the causal components of attention matrices.
Findings
Efficient attention matrices are probability distributions.
Efficient attention has a causal role in model predictions.
Empirical results support the effectiveness of efficient attention across datasets.
Abstract
This paper explores the much discussed, possible explanatory link between attention weights (AW) in transformer models and predicted output. Contrary to intuition and early research on attention, more recent prior research has provided formal arguments and empirical evidence that AW are not explanatorily relevant. We show that the formal arguments are incorrect. We introduce and effectively compute efficient attention, which isolates the effective components of attention matrices in tasks and models in which AW play an explanatory role. We show that efficient attention has a causal role (provides minimally necessary and sufficient conditions) for predicting model output in NLP tasks requiring contextual information, and we show, contrary to [7], that efficient attention matrices are probability distributions and are effectively calculable. Thus, they should play an important part in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSoftmax · Attention Is All You Need
