Masked Language Modeling and the Distributional Hypothesis: Order Word   Matters Pre-training for Little

Koustuv Sinha; Robin Jia; Dieuwke Hupkes; Joelle Pineau; Adina; Williams; Douwe Kiela

arXiv:2104.06644·cs.CL·September 13, 2021·1 cites

Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little

Koustuv Sinha, Robin Jia, Dieuwke Hupkes, Joelle Pineau, Adina, Williams, Douwe Kiela

PDF

Open Access

TL;DR

This paper shows that masked language models mainly succeed due to their ability to capture higher-order word co-occurrence statistics, even when trained on shuffled sentences, challenging the emphasis on syntactic structure learning.

Contribution

It demonstrates that MLMs can perform well on downstream tasks without preserving word order, highlighting the dominance of distributional statistics in their success.

Findings

01

MLMs trained on shuffled sentences still achieve high downstream accuracy

02

Models perform well on syntactic probes despite ignoring word order

03

Results suggest distributional information explains MLM success

Abstract

A possible explanation for the impressive performance of masked language model (MLM) pre-training is that such models have learned to represent the syntactic structures prevalent in classical NLP pipelines. In this paper, we propose a different explanation: MLMs succeed on downstream tasks almost entirely due to their ability to model higher-order word co-occurrence statistics. To demonstrate this, we pre-train MLMs on sentences with randomly shuffled word order, and show that these models still achieve high accuracy after fine-tuning on many downstream tasks -- including on tasks specifically designed to be challenging for models that ignore word order. Our models perform surprisingly well according to some parametric syntactic probes, indicating possible deficiencies in how we test representations for syntactic information. Overall, our results show that purely distributional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications