[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining   on a developmentally plausible corpus

Leshem Choshen; Ryan Cotterell; Michael Y. Hu; Tal Linzen; Aaron; Mueller; Candace Ross; Alex Warstadt; Ethan Wilcox; Adina Williams; Chengxu; Zhuang

arXiv:2404.06214·cs.CL·July 30, 2024·2 cites

[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

Leshem Choshen, Ryan Cotterell, Michael Y. Hu, Tal Linzen, Aaron, Mueller, Candace Ross, Alex Warstadt, Ethan Wilcox, Adina Williams, Chengxu, Zhuang

PDF

Open Access 1 Repo 2 Datasets

TL;DR

The 2nd BabyLM Challenge aims to promote sample-efficient pretraining on developmentally plausible corpora, introducing new rules, a paper track, and multimodal data to advance language model research.

Contribution

This paper announces the second BabyLM Challenge with updated rules, a new paper track, and multimodal data, encouraging diverse and cognitively-inspired approaches to language model pretraining.

Findings

01

Introduction of a paper track for diverse submissions

02

Relaxed data construction rules within 100M/10M words

03

Provision of multimodal vision-and-language dataset

Abstract

After last year's successful BabyLM Challenge, the competition will be hosted again in 2024/2025. The overarching goals of the challenge remain the same; however, some of the competition rules will be different. The big changes for this year's competition are as follows: First, we replace the loose track with a paper track, which allows (for example) non-model-based submissions, novel cognitively-inspired benchmarks, or analysis techniques. Second, we are relaxing the rules around pretraining data, and will now allow participants to construct their own datasets provided they stay within the 100M-word or 10M-word budget. Third, we introduce a multimodal vision-and-language track, and will release a corpus of 50% text-only and 50% image-text multimodal data as a starting point for LM model training. The purpose of this CfP is to provide rules for this year's challenge, explain these rule…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

babylm/evaluation-pipeline-2024
pytorch

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling