# Is Attention Interpretable?

**Authors:** Sofia Serrano, Noah A. Smith

arXiv: 1906.03731 · 2019-06-11

## TL;DR

This paper critically evaluates whether attention weights in NLP models truly reflect input importance, finding that they are only loosely correlated and not reliable indicators of model reasoning.

## Contribution

It provides empirical evidence that attention weights are not consistently interpretable indicators of input importance in trained models.

## Key findings

- Attention weights sometimes correlate with input importance.
- Gradient-based rankings often better predict influence than raw attention.
- Attention is a noisy and unreliable interpretability tool.

## Abstract

Attention mechanisms have recently boosted performance on a range of NLP tasks. Because attention layers explicitly weight input components' representations, it is also often assumed that attention can be used to identify information that models found important (e.g., specific contextualized word tokens). We test whether that assumption holds by manipulating attention weights in already-trained text classification models and analyzing the resulting differences in their predictions. While we observe some ways in which higher attention weights correlate with greater impact on model predictions, we also find many ways in which this does not hold, i.e., where gradient-based rankings of attention weights better predict their effects than their magnitudes. We conclude that while attention noisily predicts input components' overall importance to a model, it is by no means a fail-safe indicator.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1906.03731/full.md

## Figures

43 figures with captions in the complete paper: https://tomesphere.com/paper/1906.03731/full.md

## References

39 references — full list in the complete paper: https://tomesphere.com/paper/1906.03731/full.md

---
Source: https://tomesphere.com/paper/1906.03731