Enhancing Multilingual LLM Pretraining with Model-Based Data Selection
Bettina Messmer, Vinko Sabol\v{c}ec, Martin Jaggi

TL;DR
This paper introduces a transparent, efficient model-based data filtering framework for multilingual LLM pretraining, improving performance and resource efficiency across diverse languages and scripts.
Contribution
It develops a novel, accessible filtering method for multilingual datasets that enhances LLM training efficiency and effectiveness, especially for low-resource languages.
Findings
Matching baseline MMLU scores with only 15% of training tokens
Improved performance across multiple benchmarks
Mitigated the curse of multilinguality
Abstract
Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on English. To address the disparity stemming from limited research on non-English languages, we develop a model-based filtering framework for multilingual datasets that aims to identify a diverse set of structured and knowledge-rich samples. Our approach emphasizes transparency, simplicity, and efficiency, leveraging Transformer- and FastText-based classifiers to ensure the broad accessibility of our technique and data. We conduct comprehensive ablation studies on the FineWeb-2 web crawl dataset across diverse language families, scripts, and resource availability to demonstrate the effectiveness of our method. Training a 1B-parameter Llama model for 70B…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification
