BAMBINO-LM: (Bilingual-)Human-Inspired Continual Pretraining of BabyLM

Zhewen Shen; Aditya Joshi; Ruey-Cheng Chen

arXiv:2406.11418·cs.CL·July 10, 2024

BAMBINO-LM: (Bilingual-)Human-Inspired Continual Pretraining of BabyLM

Zhewen Shen, Aditya Joshi, Ruey-Cheng Chen

PDF

Open Access 1 Video

TL;DR

BAMBINO-LM introduces a human-inspired continual pretraining method for small-scale language models, improving bilingual language capabilities and mimicking human language learning behaviors.

Contribution

It proposes a novel pretraining strategy combining alternation and PPO-based rewards, demonstrating improved language skills and human-like learning effects in BabyLM models.

Findings

01

Enhanced Italian language capability in BabyLM models

02

Effectiveness depends on both alternation and PPO strategies

03

Model exhibits human-like degradation in L1 learning

Abstract

Children from bilingual backgrounds benefit from interactions with parents and teachers to re-acquire their heritage language. In this paper, we investigate how this insight from behavioral study can be incorporated into the learning of small-scale language models. We introduce BAMBINO-LM, a continual pre-training strategy for BabyLM that uses a novel combination of alternation and PPO-based perplexity reward induced from a parent Italian model. Upon evaluation on zero-shot classification tasks for English and Italian, BAMBINO-LM improves the Italian language capability of a BabyLM baseline. Our ablation analysis demonstrates that employing both the alternation strategy and PPO-based modeling is key to this effectiveness gain. We also show that, as a side effect, the proposed method leads to a similar degradation in L1 effectiveness as human children would have had in an equivalent…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

BAMBINO-LM: (Bilingual-)Human-Inspired Continual Pretraining of BabyLM· underline

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Topic Modeling