Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models
Francesca Padovani, Jaap Jumelet, Yevgen Matusevych, Arianna Bisazza

TL;DR
This study investigates whether training language models on Child-Directed Language improves syntax learning, finding inconsistent benefits compared to Wikipedia data across multiple languages and benchmarks.
Contribution
The paper critically evaluates previous claims about CDL's advantages, introduces the FIT-CLAMS testing methodology, and emphasizes the importance of controlling for frequency effects in syntactic evaluation.
Findings
CDL does not consistently outperform Wikipedia in syntax learning.
Benchmark shortcomings are identified and addressed with a new methodology.
Frequency control is crucial for fair syntactic ability evaluation.
Abstract
Seminal work by Huebner et al. (2021) showed that language models (LMs) trained on English Child-Directed Language (CDL) can reach similar syntactic abilities as LMs trained on much larger amounts of adult-directed written text, suggesting that CDL could provide more effective LM training material than the commonly used internet-crawled data. However, the generalizability of these results across languages, model types, and evaluation settings remains unclear. We test this by comparing models trained on CDL vs. Wikipedia across two LM objectives (masked and causal), three languages (English, French, German), and three syntactic minimal-pair benchmarks. Our results on these benchmarks show inconsistent benefits of CDL, which in most cases is outperformed by Wikipedia models. We then identify various shortcomings in previous benchmarks, and introduce a novel testing methodology, FIT-CLAMS,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLanguage Development and Disorders · Text Readability and Simplification · Neurobiology of Language and Bilingualism
