Is Child-Directed Speech Effective Training Data for Language Models?

Steven Y. Feng; Noah D. Goodman; Michael C. Frank

arXiv:2408.03617·cs.CL·October 10, 2024

Is Child-Directed Speech Effective Training Data for Language Models?

Steven Y. Feng, Noah D. Goodman, Michael C. Frank

PDF

Open Access 1 Repo 1 Video

TL;DR

This study investigates whether child-directed speech is effective training data for language models, finding that local data properties influence performance but global properties do not, and that children are more data-efficient learners.

Contribution

The paper introduces a comparison of child-directed speech with other datasets for training language models and analyzes the impact of data properties on model performance.

Findings

01

Local data properties affect model results.

02

Global data properties do not significantly influence performance.

03

Children learn more efficiently from less data.

Abstract

While high-performing language models are typically trained on hundreds of billions of words, human children become fluent language users with a much smaller amount of data. What are the features of the data they receive, and how do these features support language modeling objectives? To investigate this question, we train GPT-2 and RoBERTa models on 29M words of English child-directed speech and a new matched, synthetic dataset (TinyDialogues), comparing to OpenSubtitles, Wikipedia, and a heterogeneous blend of datasets from the BabyLM challenge. We evaluate the syntactic and semantic knowledge of these models using developmentally-inspired evaluations. Through pretraining experiments, we test whether the global developmental ordering or the local discourse ordering of children's training data supports high performance relative to other datasets. The local properties of the data affect…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

styfeng/tinydialogues
jaxOfficial

Videos

Is Child-Directed Speech Effective Training Data for Language Models?· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · WordPiece · Linear Warmup With Linear Decay · BERT · RoBERTa · Linear Layer · Attention Dropout · Residual Connection · Multi-Head Attention · Cosine Annealing