Effective Attention Sheds Light On Interpretability
Kaiser Sun, Ana Marasovi\'c

TL;DR
This paper introduces the concept of effective attention in transformer models, demonstrating that it offers more accurate interpretability of model behavior than standard attention by isolating the component that truly influences output.
Contribution
The paper proposes and validates the use of effective attention as a more meaningful interpretability tool compared to standard attention in transformer models.
Findings
Effective attention differs from standard attention in interpretability.
Effective attention is less linked to pretraining features like separator tokens.
Using effective attention provides better insights into linguistic features for task solving.
Abstract
An attention matrix of a transformer self-attention sublayer can provably be decomposed into two components and only one of them (effective attention) contributes to the model output. This leads us to ask whether visualizing effective attention gives different conclusions than interpretation of standard attention. Using a subset of the GLUE tasks and BERT, we carry out an analysis to compare the two attention matrices, and show that their interpretations differ. Effective attention is less associated with the features related to the language modeling pretraining such as the separator token, and it has more potential to illustrate linguistic features captured by the model for solving the end-task. Given the found differences, we recommend using effective attention for studying a transformer's behavior since it is more pertinent to the model output by design.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Software Engineering Research
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Layer Normalization · Softmax · Linear Warmup With Linear Decay · Attention Dropout · WordPiece · Weight Decay · Dropout
