[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus
Leshem Choshen, Ryan Cotterell, Michael Y. Hu, Tal Linzen, Aaron, Mueller, Candace Ross, Alex Warstadt, Ethan Wilcox, Adina Williams, Chengxu, Zhuang

TL;DR
The 2nd BabyLM Challenge aims to promote sample-efficient pretraining on developmentally plausible corpora, introducing new rules, a paper track, and multimodal data to advance language model research.
Contribution
This paper announces the second BabyLM Challenge with updated rules, a new paper track, and multimodal data, encouraging diverse and cognitively-inspired approaches to language model pretraining.
Findings
Introduction of a paper track for diverse submissions
Relaxed data construction rules within 100M/10M words
Provision of multimodal vision-and-language dataset
Abstract
After last year's successful BabyLM Challenge, the competition will be hosted again in 2024/2025. The overarching goals of the challenge remain the same; however, some of the competition rules will be different. The big changes for this year's competition are as follows: First, we replace the loose track with a paper track, which allows (for example) non-model-based submissions, novel cognitively-inspired benchmarks, or analysis techniques. Second, we are relaxing the rules around pretraining data, and will now allow participants to construct their own datasets provided they stay within the 100M-word or 10M-word budget. Third, we introduce a multimodal vision-and-language track, and will release a corpus of 50% text-only and 50% image-text multimodal data as a starting point for LM model training. The purpose of this CfP is to provide rules for this year's challenge, explain these rule…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
