Self-Attention Attribution: Interpreting Information Interactions Inside   Transformer

Yaru Hao; Li Dong; Furu Wei; Ke Xu

arXiv:2004.11207·cs.CL·February 26, 2021·23 cites

Self-Attention Attribution: Interpreting Information Interactions Inside Transformer

Yaru Hao, Li Dong, Furu Wei, Ke Xu

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces a method to interpret the internal information interactions of Transformer models, specifically BERT, by analyzing attention heads and hierarchical dependencies, enhancing understanding of model decisions.

Contribution

It proposes a novel self-attention attribution technique that identifies important attention heads, constructs hierarchical interaction trees, and demonstrates their use in adversarial attacks.

Findings

01

Important attention heads can be pruned with minimal performance loss.

02

Hierarchical interaction trees reveal internal dependencies within Transformer layers.

03

Attribution patterns can be used to generate adversarial attacks.

Abstract

The great success of Transformer-based models benefits from the powerful multi-head self-attention mechanism, which learns token dependencies and encodes contextual information from the input. Prior work strives to attribute model decisions to individual input features with different saliency measures, but they fail to explain how these input features interact with each other to reach predictions. In this paper, we propose a self-attention attribution method to interpret the information interactions inside Transformer. We take BERT as an example to conduct extensive studies. Firstly, we apply self-attention attribution to identify the important attention heads, while others can be pruned with marginal performance degradation. Furthermore, we extract the most salient dependencies in each layer to construct an attribution tree, which reveals the hierarchical interactions inside…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Self-Attention Attribution: Interpreting Information Interactions Inside Transformer· underline

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · *Communicated@Fast*How Do I Communicate to Expedia? · Byte Pair Encoding · Label Smoothing · Transformer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay