Does Pre-training Induce Systematic Inference? How Masked Language   Models Acquire Commonsense Knowledge

Ian Porada; Alessandro Sordoni; Jackie Chi Kit Cheung

arXiv:2112.08583·cs.CL·December 17, 2021

Does Pre-training Induce Systematic Inference? How Masked Language Models Acquire Commonsense Knowledge

Ian Porada, Alessandro Sordoni, Jackie Chi Kit Cheung

PDF

Open Access

TL;DR

This study investigates whether masked language models like BERT acquire commonsense knowledge through systematic inference or surface-level patterns, finding that they primarily learn from co-occurrence rather than reasoning.

Contribution

The paper introduces a method to test if pre-training induces systematic inference, revealing that models mainly learn from surface patterns, not reasoning.

Findings

01

Generalization does not improve during pre-training.

02

Commonsense knowledge is mainly from surface-level patterns.

03

Systematic inference is not significantly induced by pre-training.

Abstract

Transformer models pre-trained with a masked-language-modeling objective (e.g., BERT) encode commonsense knowledge as evidenced by behavioral probes; however, the extent to which this knowledge is acquired by systematic inference over the semantics of the pre-training corpora is an open question. To answer this question, we selectively inject verbalized knowledge into the minibatches of a BERT model during pre-training and evaluate how well the model generalizes to supported inferences. We find generalization does not improve over the course of pre-training, suggesting that commonsense knowledge is acquired from surface-level, co-occurrence patterns rather than induced, systematic reasoning.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Language and cultural evolution

MethodsAttention Is All You Need · Linear Layer · Dense Connections · Adam · Multi-Head Attention · Linear Warmup With Linear Decay · Attention Dropout · Residual Connection · Layer Normalization · Refunds@Expedia|||How do I get a full refund from Expedia?