TL;DR
This paper investigates the identifiability of attention weights in Transformers for text classification, revealing their hidden role, proposing a variant for more identifiable weights, and empirically validating its effectiveness.
Contribution
It provides a deeper theoretical analysis of attention weight identifiability, uncovers the role of key vectors, and introduces a new encoder variant for more interpretable attention.
Findings
Attention weights are more identifiable than previously thought.
The proposed encoder variant achieves identifiable weights up to input length.
Empirical results validate the effectiveness of the new variant on text classification tasks.
Abstract
Interpretability is an important aspect of the trustworthiness of a model's predictions. Transformer's predictions are widely explained by the attention weights, i.e., a probability distribution generated at its self-attention unit (head). Current empirical studies provide shreds of evidence that attention weights are not explanations by proving that they are not unique. A recent study showed theoretical justifications to this observation by proving the non-identifiability of attention weights. For a given input to a head and its output, if the attention weights generated in it are unique, we call the weights identifiable. In this work, we provide deeper theoretical analysis and empirical observations on the identifiability of attention weights. Ignored in the previous works, we find the attention weights are more identifiable than we currently perceive by uncovering the hidden role of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
