On the importance of pre-training data volume for compact language   models

Vincent Micheli; Martin d'Hoffschmidt; Fran\c{c}ois Fleuret

arXiv:2010.03813·cs.CL·October 12, 2020

On the importance of pre-training data volume for compact language models

Vincent Micheli, Martin d'Hoffschmidt, Fran\c{c}ois Fleuret

PDF

TL;DR

This paper investigates how the volume of pre-training data affects the performance of compact language models, demonstrating that small datasets can still produce effective models and that intermediate pre-training offers limited benefits.

Contribution

It provides empirical evidence on the minimal data requirements for effective pre-training of compact language models and evaluates the impact of intermediate pre-training steps.

Findings

01

Well-performing models can be trained with as little as 100 MB of text.

02

Intermediate pre-training on task-specific data does not significantly improve performance.

03

Pre-training data volume critically influences model effectiveness.

Abstract

Recent advances in language modeling have led to computationally intensive and resource-demanding state-of-the-art models. In an effort towards sustainable practices, we study the impact of pre-training data volume on compact language models. Multiple BERT-based models are trained on gradually increasing amounts of French text. Through fine-tuning on the French Question Answering Dataset (FQuAD), we observe that well-performing models are obtained with as little as 100 MB of text. In addition, we show that past critically low amounts of pre-training data, an intermediate pre-training step on the task-specific corpus does not yield substantial improvements.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.