A Diversity Diet for a Healthier Model: A Case Study of French ModernBERT
Louis Est\`eve, Christophe Servan, Thomas Lavergne, Agata Savary

TL;DR
This study investigates how diversity-focused data sampling during pre-training can reduce dataset size and training time for ModernBERT while maintaining or improving performance across NLP tasks.
Contribution
It introduces diversity-driven sampling algorithms for pre-training data selection, demonstrating significant efficiency gains and performance improvements over random sampling.
Findings
Diversity-driven sampling improves task performance by up to 10 points.
A smaller, diverse dataset can match the performance of a much larger, random dataset.
Pre-training on a diverse dataset requires substantially less time and data.
Abstract
Diversity has been gaining interest in the NLP community in recent years. At the same time, state-of-the-art transformer models such as ModernBERT use very large pre-training datasets, which are driven by size rather than by diversity. This summons for an investigation of the impact of diversity on the ModernBERT pre-training. We do so in this study, with the express intent of reducing pre-training dataset size, while retaining at least comparable performance. We compare diversity-driven sampling algorithms, so as to pick the best one. We find that diversity-driven sampling allows in some tasks to gain 10 points relative to randomly-sampled pre-training data of commensurate size. We also see that a model pre-trained for 483h on a diversity-driven dataset of 150M tokens can yield a commensurate performance to a model pre-trained for 1,775h on a randomly-driven dataset of 2.4B tokens.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
