Attention is not Explanation
Sarthak Jain, Byron C. Wallace

TL;DR
This paper critically examines the common assumption that attention weights in neural NLP models serve as meaningful explanations, demonstrating through experiments that they do not reliably indicate feature importance or model reasoning.
Contribution
The study provides extensive empirical evidence that attention weights are not reliable explanations, challenging their interpretability in neural NLP models.
Findings
Attention weights are often uncorrelated with gradient-based importance measures.
Different attention distributions can produce the same model predictions.
Attention modules do not provide meaningful explanations for model decisions.
Abstract
Attention mechanisms have seen wide adoption in neural NLP models. In addition to improving predictive performance, these are often touted as affording transparency: models equipped with attention provide a distribution over attended-to input units, and this is often presented (at least implicitly) as communicating the relative importance of inputs. However, it is unclear what relationship exists between attention weights and model outputs. In this work, we perform extensive experiments across a variety of NLP tasks that aim to assess the degree to which attention weights provide meaningful `explanations' for predictions. We find that they largely do not. For example, learned attention weights are frequently uncorrelated with gradient-based measures of feature importance, and one can identify very different attention distributions that nonetheless yield equivalent predictions. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
