Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models

Linda Zeng; Steven Y. Feng; Michael C. Frank

arXiv:2603.29552·cs.CL·May 8, 2026

Bringing Up a Bilingual BabyLM: Investigating Multilingual Language Acquisition Using Small-Scale Models

Linda Zeng, Steven Y. Feng, Michael C. Frank

PDF

1 Datasets

TL;DR

This study uses controlled language model training to simulate bilingual language acquisition, finding that different exposure regimes do not significantly hinder learning, and bilingual input is manageable for statistical learners.

Contribution

It introduces a method to simulate bilingual language acquisition with small-scale models, providing insights into how exposure regimes affect learning outcomes.

Findings

01

Bilingual models perform similarly to monolingual models in one language.

02

Bilingual models show strong performance in the second language.

03

Different exposure regimes do not significantly impact learning outcomes.

Abstract

Multilingualism is incredibly common around the world, leading to many important theoretical and practical questions about how children learn multiple languages at once. For example, does multilingual acquisition lead to delays in learning? Are there better and worse ways to structure multilingual input? Many correlational studies address these questions, but it is surprisingly difficult to get definitive answers because children cannot be randomly assigned to be multilingual and data are typically not matched between languages. We use language model training as a method for simulating a variety of highly controlled exposure conditions, and create matched 100M-word mono- and bilingual datasets using synthetic data and machine translation. We train GPT-2 models on monolingual and bilingual data organized to reflect a range of exposure regimes, and evaluate their performance on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

lindazeng979/bilingual-babyLM
dataset· 73 dl
73 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.