Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on   Developmentally Plausible Corpora

Michael Y. Hu; Aaron Mueller; Candace Ross; Adina Williams; Tal; Linzen; Chengxu Zhuang; Ryan Cotterell; Leshem Choshen; Alex Warstadt; Ethan; Gotlieb Wilcox

arXiv:2412.05149·cs.CL·December 9, 2024·3 cites

Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

Michael Y. Hu, Aaron Mueller, Candace Ross, Adina Williams, Tal, Linzen, Chengxu Zhuang, Ryan Cotterell, Leshem Choshen, Alex Warstadt, Ethan, Gotlieb Wilcox

PDF

Open Access

TL;DR

This paper reports on the BabyLM Challenge, demonstrating that diverse methods, especially hybrid models, improve sample-efficient language learning on limited data, with insights into data, objectives, and architecture changes.

Contribution

It introduces improved corpora and a multimodal challenge, highlighting that hybrid causal-masked models and data/architecture modifications enhance small-scale language model performance.

Findings

01

Hybrid causal-masked models outperform other approaches.

02

No multimodal submissions beat baseline performance.

03

Training FLOPs strongly correlate with task performance.

Abstract

The BabyLM Challenge is a community effort to close the data-efficiency gap between human and computational language learners. Participants compete to optimize language model training on a fixed language data budget of 100 million words or less. This year, we released improved text corpora, as well as a vision-and-language corpus to facilitate research into cognitively plausible vision language models. Submissions were compared on evaluation tasks targeting grammatical ability, (visual) question answering, pragmatic abilities, and grounding, among other abilities. Participants could submit to a 10M-word text-only track, a 100M-word text-only track, and/or a 100M-word and image multimodal track. From 31 submissions employing diverse methods, a hybrid causal-masked language model architecture outperformed other approaches. No submissions outperformed the baselines in the multimodal track.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Natural Language Processing Techniques