Is Child-Directed Speech Effective Training Data for Language Models?
Steven Y. Feng, Noah D. Goodman, Michael C. Frank

TL;DR
This study investigates whether child-directed speech is effective training data for language models, finding that local data properties influence performance but global properties do not, and that children are more data-efficient learners.
Contribution
The paper introduces a comparison of child-directed speech with other datasets for training language models and analyzes the impact of data properties on model performance.
Findings
Local data properties affect model results.
Global data properties do not significantly influence performance.
Children learn more efficiently from less data.
Abstract
While high-performing language models are typically trained on hundreds of billions of words, human children become fluent language users with a much smaller amount of data. What are the features of the data they receive, and how do these features support language modeling objectives? To investigate this question, we train GPT-2 and RoBERTa models on 29M words of English child-directed speech and a new matched, synthetic dataset (TinyDialogues), comparing to OpenSubtitles, Wikipedia, and a heterogeneous blend of datasets from the BabyLM challenge. We evaluate the syntactic and semantic knowledge of these models using developmentally-inspired evaluations. Through pretraining experiments, we test whether the global developmental ordering or the local discourse ordering of children's training data supports high performance relative to other datasets. The local properties of the data affect…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · WordPiece · Linear Warmup With Linear Decay · BERT · RoBERTa · Linear Layer · Attention Dropout · Residual Connection · Multi-Head Attention · Cosine Annealing
