How much pretraining data do language models need to learn syntax?

Laura P\'erez-Mayos; Miguel Ballesteros; Leo Wanner

arXiv:2109.03160·cs.CL·September 10, 2021

How much pretraining data do language models need to learn syntax?

Laura P\'erez-Mayos, Miguel Ballesteros, Leo Wanner

PDF

Open Access

TL;DR

This study investigates how the size of pretraining data affects the syntactic knowledge and performance of RoBERTa models, revealing that more data generally improves syntactic understanding but with diminishing returns and higher costs.

Contribution

The paper provides a systematic analysis of the relationship between pretraining data size and syntactic capabilities in transformer-based models, highlighting cost-benefit trade-offs.

Findings

01

Larger pretraining data leads to increased syntactic knowledge.

02

More data improves downstream task performance.

03

Diminishing returns observed beyond certain data thresholds.

Abstract

Transformers-based pretrained language models achieve outstanding results in many well-known NLU benchmarks. However, while pretraining methods are very convenient, they are expensive in terms of time and resources. This calls for a study of the impact of pretraining data size on the knowledge of the models. We explore this impact on the syntactic capabilities of RoBERTa, using models trained on incremental sizes of raw text data. First, we use syntactic structural probes to determine whether models pretrained on more data encode a higher amount of syntactic information. Second, we perform a targeted syntactic evaluation to analyze the impact of pretraining data size on the syntactic generalization performance of the models. Third, we compare the performance of the different models on three downstream applications: part-of-speech tagging, dependency parsing and paraphrase…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Linear Layer · Dropout · Softmax · Attention Dropout · Multi-Head Attention · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Layer Normalization · Dense Connections