Are BabyLMs Second Language Learners?
Lukas Edman, Lisa Bylinina, Faeze Ghorbanpour, Alexander Fraser

TL;DR
This paper explores a second language learning approach for BabyLM models, emphasizing explicit linguistic data like grammar and paraphrases, and finds paraphrase data most improves model performance.
Contribution
It introduces a second language learning perspective for BabyLM, utilizing explicit linguistic data and demonstrating the impact of paraphrase data on model performance.
Findings
Explicit word meaning data does not improve performance.
Grammatical information provides small gains.
Paraphrase data significantly enhances model results.
Abstract
This paper describes a linguistically-motivated approach to the 2024 edition of the BabyLM Challenge (Warstadt et al. 2023). Rather than pursuing a first language learning (L1) paradigm, we approach the challenge from a second language (L2) learning perspective. In L2 learning, there is a stronger focus on learning explicit linguistic information, such as grammatical notions, definitions of words or different ways of expressing a meaning. This makes L2 learning potentially more efficient and concise. We approximate this using data from Wiktionary, grammar examples either generated by an LLM or sourced from grammar books, and paraphrase data. We find that explicit information about word meaning (in our case, Wiktionary) does not boost model performance, while grammatical information can give a small improvement. The most impactful data ingredient is sentence paraphrases, with our two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecond Language Learning and Teaching · EFL/ESL Teaching and Learning
MethodsFocus
