Mean BERTs make erratic language teachers: the effectiveness of latent   bootstrapping in low-resource settings

David Samuel

arXiv:2310.19420·cs.CL·October 31, 2023·2 cites

Mean BERTs make erratic language teachers: the effectiveness of latent bootstrapping in low-resource settings

David Samuel

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper investigates latent bootstrapping, a self-supervision method using contextualized embeddings, to improve pretraining of language models in low-resource scenarios, evaluated through the BabyLM shared task.

Contribution

It introduces latent bootstrapping as a novel self-supervision technique leveraging contextualized embeddings for low-resource language model pretraining.

Findings

01

Latent bootstrapping enhances linguistic knowledge acquisition in limited data settings.

02

Pretraining with latent bootstrapping improves performance on linguistic benchmarks.

03

The approach shows promise for low-resource language modeling.

Abstract

This paper explores the use of latent bootstrapping, an alternative self-supervision technique, for pretraining language models. Unlike the typical practice of using self-supervision on discrete subwords, latent bootstrapping leverages contextualized embeddings for a richer supervision signal. We conduct experiments to assess how effective this approach is for acquiring linguistic knowledge from limited resources. Specifically, our experiments are based on the BabyLM shared task, which includes pretraining on two small curated corpora and an evaluation on four linguistic benchmarks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ltgoslo/boot-bert
pytorchOfficial

Datasets

SrikrishnaIyer/Babylm-processed-2023
dataset· 15 dl
15 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech and dialogue systems