Mean BERTs make erratic language teachers: the effectiveness of latent bootstrapping in low-resource settings
David Samuel

TL;DR
This paper investigates latent bootstrapping, a self-supervision method using contextualized embeddings, to improve pretraining of language models in low-resource scenarios, evaluated through the BabyLM shared task.
Contribution
It introduces latent bootstrapping as a novel self-supervision technique leveraging contextualized embeddings for low-resource language model pretraining.
Findings
Latent bootstrapping enhances linguistic knowledge acquisition in limited data settings.
Pretraining with latent bootstrapping improves performance on linguistic benchmarks.
The approach shows promise for low-resource language modeling.
Abstract
This paper explores the use of latent bootstrapping, an alternative self-supervision technique, for pretraining language models. Unlike the typical practice of using self-supervision on discrete subwords, latent bootstrapping leverages contextualized embeddings for a richer supervision signal. We conduct experiments to assess how effective this approach is for acquiring linguistic knowledge from limited resources. Specifically, our experiments are based on the BabyLM shared task, which includes pretraining on two small curated corpora and an evaluation on four linguistic benchmarks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems
