Self-Attention Attribution: Interpreting Information Interactions Inside Transformer
Yaru Hao, Li Dong, Furu Wei, Ke Xu

TL;DR
This paper introduces a method to interpret the internal information interactions of Transformer models, specifically BERT, by analyzing attention heads and hierarchical dependencies, enhancing understanding of model decisions.
Contribution
It proposes a novel self-attention attribution technique that identifies important attention heads, constructs hierarchical interaction trees, and demonstrates their use in adversarial attacks.
Findings
Important attention heads can be pruned with minimal performance loss.
Hierarchical interaction trees reveal internal dependencies within Transformer layers.
Attribution patterns can be used to generate adversarial attacks.
Abstract
The great success of Transformer-based models benefits from the powerful multi-head self-attention mechanism, which learns token dependencies and encodes contextual information from the input. Prior work strives to attribute model decisions to individual input features with different saliency measures, but they fail to explain how these input features interact with each other to reach predictions. In this paper, we propose a self-attention attribution method to interpret the information interactions inside Transformer. We take BERT as an example to conduct extensive studies. Firstly, we apply self-attention attribution to identify the important attention heads, while others can be pruned with marginal performance degradation. Furthermore, we extract the most salient dependencies in each layer to construct an attribution tree, which reveals the hierarchical interactions inside…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · *Communicated@Fast*How Do I Communicate to Expedia? · Byte Pair Encoding · Label Smoothing · Transformer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay
