Baby Scale: Investigating Models Trained on Individual Children's Language Input

Steven Y. Feng; Alvin W.M. Tan; Michael C. Frank

arXiv:2603.29522·cs.CL·April 1, 2026

Baby Scale: Investigating Models Trained on Individual Children's Language Input

Steven Y. Feng, Alvin W.M. Tan, Michael C. Frank

PDF

TL;DR

This study examines how language models trained on child-specific data from the BabyView dataset perform across various tasks, revealing insights into linguistic development and data quality effects.

Contribution

It introduces an analysis of models trained on child language input, highlighting factors influencing learning efficiency and variability across individual children's data.

Findings

01

Models trained on child data perform well on grammar tasks.

02

Semantic and world knowledge tasks show lower scaling performance.

03

Word likelihoods in models correlate with children's word learning.

Abstract

Modern language models (LMs) must be trained on many orders of magnitude more words of training data than human children receive before they begin to produce useful behavior. Assessing the nature and origins of this "data gap" requires benchmarking LMs on human-scale datasets to understand how linguistic knowledge emerges from children's natural training data. Using transcripts from the BabyView dataset (videos from children ages 6-36 months), we investigate (1) scaling performance at child-scale data regimes, (2) variability in model performance across datasets from different children's experiences and linguistic predictors of dataset quality, and (3) relationships between model and child language learning outcomes. LMs trained on child data show acceptable scaling for grammar tasks, but lower scaling on semantic and world knowledge tasks than models trained on synthetic data; we also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.