Do Attention Heads in BERT Track Syntactic Dependencies?
Phu Mon Htut, Jason Phang, Shikha Bordia, Samuel R. Bowman

TL;DR
This paper examines whether individual attention heads in BERT and RoBERTa implicitly encode syntactic dependency relations, finding some heads specialize in certain dependencies but no single head performs comprehensive parsing.
Contribution
It introduces methods to extract dependency relations from attention weights and analyzes how fine-tuning impacts these syntactic patterns in transformer models.
Findings
Some attention heads encode specific dependency types.
Fine-tuning does not substantially change dependency patterns.
No single head performs holistic syntactic parsing better than baselines.
Abstract
We investigate the extent to which individual attention heads in pretrained transformer language models, such as BERT and RoBERTa, implicitly capture syntactic dependency relations. We employ two methods---taking the maximum attention weight and computing the maximum spanning tree---to extract implicit dependency relations from the attention weights of each layer/head, and compare them to the ground-truth Universal Dependency (UD) trees. We show that, for some UD relation types, there exist heads that can recover the dependency type significantly better than baselines on parsed English text, suggesting that some self-attention heads act as a proxy for syntactic structure. We also analyze BERT fine-tuned on two datasets---the syntax-oriented CoLA and the semantics-oriented MNLI---to investigate whether fine-tuning affects the patterns of their self-attention, but we do not observe…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Neurobiology of Language and Bilingualism
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · RoBERTa · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia?
