The heads hypothesis: A unifying statistical approach towards   understanding multi-headed attention in BERT

Madhura Pande; Aakriti Budhraja; Preksha Nema; Pratyush Kumar and; Mitesh M. Khapra

arXiv:2101.09115·cs.CL·January 25, 2021

The heads hypothesis: A unifying statistical approach towards understanding multi-headed attention in BERT

Madhura Pande, Aakriti Budhraja, Preksha Nema, Pratyush Kumar and, Mitesh M. Khapra

PDF

1 Repo 1 Video

TL;DR

This paper introduces a unified statistical framework for analyzing multi-headed attention in BERT, enabling robust, comparable, and significance-aware classification of attention head roles across different contexts.

Contribution

It proposes a generalized score and hypothesis testing method for classifying attention head roles, addressing inconsistencies and statistical significance issues in prior approaches.

Findings

01

Identifies co-location of multiple roles within the same attention head

02

Analyzes distribution of attention heads across BERT layers

03

Examines impact of fine-tuning on attention head roles

Abstract

Multi-headed attention heads are a mainstay in transformer-based models. Different methods have been proposed to classify the role of each attention head based on the relations between tokens which have high pair-wise attention. These roles include syntactic (tokens with some syntactic relation), local (nearby tokens), block (tokens in the same sentence) and delimiter (the special [CLS], [SEP] tokens). There are two main challenges with existing methods for classification: (a) there are no standard scores across studies or across functional roles, and (b) these scores are often average quantities measured across sentences without capturing statistical significance. In this work, we formalize a simple yet effective score that generalizes to all the roles of attention heads and employs hypothesis testing on this score for robust inference. This provides us the right lens to systematically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

iitmnlp/heads-hypothesis
tfOfficial

Videos

The Heads Hypothesis: A Unifying Statistical Approach towards Understanding Multi-Headed Attention in BERT· underline

Taxonomy

MethodsLinear Layer · WordPiece · Attention Dropout · Residual Connection · Layer Normalization · Dense Connections · Attention Is All You Need · Adam · Linear Warmup With Linear Decay · Dropout