On the importance of pre-training data volume for compact language models
Vincent Micheli, Martin d'Hoffschmidt, Fran\c{c}ois Fleuret

TL;DR
This paper investigates how the volume of pre-training data affects the performance of compact language models, demonstrating that small datasets can still produce effective models and that intermediate pre-training offers limited benefits.
Contribution
It provides empirical evidence on the minimal data requirements for effective pre-training of compact language models and evaluates the impact of intermediate pre-training steps.
Findings
Well-performing models can be trained with as little as 100 MB of text.
Intermediate pre-training on task-specific data does not significantly improve performance.
Pre-training data volume critically influences model effectiveness.
Abstract
Recent advances in language modeling have led to computationally intensive and resource-demanding state-of-the-art models. In an effort towards sustainable practices, we study the impact of pre-training data volume on compact language models. Multiple BERT-based models are trained on gradually increasing amounts of French text. Through fine-tuning on the French Question Answering Dataset (FQuAD), we observe that well-performing models are obtained with as little as 100 MB of text. In addition, we show that past critically low amounts of pre-training data, an intermediate pre-training step on the task-specific corpus does not yield substantial improvements.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
