Do Transformers Parse while Predicting the Masked Word?
Haoyu Zhao, Abhishek Panigrahi, Rong Ge, Sanjeev Arora

TL;DR
This paper demonstrates that moderate-sized transformers can approximate the Inside-Outside algorithm for parsing and that masked language models inherently capture parsing structures, explaining their ability to encode linguistic information.
Contribution
It shows that transformers can explicitly perform approximate parsing and that masked language models are implicitly biased towards algorithms like Inside-Outside for parsing tasks.
Findings
Transformers can approximate the Inside-Outside algorithm.
Masked language models capture parsing structures.
Explicit transformer constructions enable >70% F1 constituency parsing.
Abstract
Pre-trained language models have been shown to encode linguistic structures, e.g. dependency and constituency parse trees, in their embeddings while being trained on unsupervised loss functions like masked language modeling. Some doubts have been raised whether the models actually are doing parsing or only some computation weakly correlated with it. We study questions: (a) Is it possible to explicitly describe transformers with realistic embedding dimension, number of heads, etc. that are capable of doing parsing -- or even approximate parsing? (b) Why do pre-trained models capture parsing structure? This paper takes a step toward answering these questions in the context of generative modeling with PCFGs. We show that masked language models like BERT or RoBERTa of moderate sizes can approximately execute the Inside-Outside algorithm for the English PCFG [Marcus et al, 1993]. We also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods
MethodsMulti-Head Attention · Attention Is All You Need · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Dense Connections · Attention Dropout · WordPiece · RoBERTa · Linear Layer · Weight Decay
