Have Attention Heads in BERT Learned Constituency Grammar?
Ziyang Luo

TL;DR
This paper investigates whether attention heads in BERT and RoBERTa implicitly learn constituency grammar, analyzing how fine-tuning affects this ability and its relation to natural language understanding tasks.
Contribution
It provides a novel analysis of constituency grammar induction in attention heads and examines the impact of different fine-tuning tasks on this ability.
Findings
Some attention heads can induce constituency grammar better than baselines.
Fine-tuning on SMS tasks decreases CGI ability in upper layers.
Fine-tuning on NLI tasks increases CGI ability.
Abstract
With the success of pre-trained language models in recent years, more and more researchers focus on opening the "black box" of these models. Following this interest, we carry out a qualitative and quantitative analysis of constituency grammar in attention heads of BERT and RoBERTa. We employ the syntactic distance method to extract implicit constituency grammar from the attention weights of each head. Our results show that there exist heads that can induce some grammar types much better than baselines, suggesting that some heads act as a proxy for constituency grammar. We also analyze how attention heads' constituency grammar inducing (CGI) ability changes after fine-tuning with two kinds of tasks, including sentence meaning similarity (SMS) tasks and natural language inference (NLI) tasks. Our results suggest that SMS tasks decrease the average CGI ability of upper layers, while NLI…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
MethodsLinear Layer · Weight Decay · Multi-Head Attention · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Layer Normalization · WordPiece · Dense Connections · Adam · Linear Warmup With Linear Decay
