Do Transformers Parse while Predicting the Masked Word?

Haoyu Zhao; Abhishek Panigrahi; Rong Ge; Sanjeev Arora

arXiv:2303.08117·cs.CL·October 17, 2023·1 cites

Do Transformers Parse while Predicting the Masked Word?

Haoyu Zhao, Abhishek Panigrahi, Rong Ge, Sanjeev Arora

PDF

Open Access

TL;DR

This paper demonstrates that moderate-sized transformers can approximate the Inside-Outside algorithm for parsing and that masked language models inherently capture parsing structures, explaining their ability to encode linguistic information.

Contribution

It shows that transformers can explicitly perform approximate parsing and that masked language models are implicitly biased towards algorithms like Inside-Outside for parsing tasks.

Findings

01

Transformers can approximate the Inside-Outside algorithm.

02

Masked language models capture parsing structures.

03

Explicit transformer constructions enable >70% F1 constituency parsing.

Abstract

Pre-trained language models have been shown to encode linguistic structures, e.g. dependency and constituency parse trees, in their embeddings while being trained on unsupervised loss functions like masked language modeling. Some doubts have been raised whether the models actually are doing parsing or only some computation weakly correlated with it. We study questions: (a) Is it possible to explicitly describe transformers with realistic embedding dimension, number of heads, etc. that are capable of doing parsing -- or even approximate parsing? (b) Why do pre-trained models capture parsing structure? This paper takes a step toward answering these questions in the context of generative modeling with PCFGs. We show that masked language models like BERT or RoBERTa of moderate sizes can approximately execute the Inside-Outside algorithm for the English PCFG [Marcus et al, 1993]. We also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Computational and Text Analysis Methods

MethodsMulti-Head Attention · Attention Is All You Need · Refunds@Expedia|||How do I get a full refund from Expedia? · Residual Connection · Dense Connections · Attention Dropout · WordPiece · RoBERTa · Linear Layer · Weight Decay