TL;DR
This paper introduces a unified statistical framework for analyzing multi-headed attention in BERT, enabling robust, comparable, and significance-aware classification of attention head roles across different contexts.
Contribution
It proposes a generalized score and hypothesis testing method for classifying attention head roles, addressing inconsistencies and statistical significance issues in prior approaches.
Findings
Identifies co-location of multiple roles within the same attention head
Analyzes distribution of attention heads across BERT layers
Examines impact of fine-tuning on attention head roles
Abstract
Multi-headed attention heads are a mainstay in transformer-based models. Different methods have been proposed to classify the role of each attention head based on the relations between tokens which have high pair-wise attention. These roles include syntactic (tokens with some syntactic relation), local (nearby tokens), block (tokens in the same sentence) and delimiter (the special [CLS], [SEP] tokens). There are two main challenges with existing methods for classification: (a) there are no standard scores across studies or across functional roles, and (b) these scores are often average quantities measured across sentences without capturing statistical significance. In this work, we formalize a simple yet effective score that generalizes to all the roles of attention heads and employs hypothesis testing on this score for robust inference. This provides us the right lens to systematically…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
MethodsLinear Layer · WordPiece · Attention Dropout · Residual Connection · Layer Normalization · Dense Connections · Attention Is All You Need · Adam · Linear Warmup With Linear Decay · Dropout
