Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Bettina Messmer; Vinko Sabol\v{c}ec; Martin Jaggi

arXiv:2502.10361·cs.CL·February 20, 2026

Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

Bettina Messmer, Vinko Sabol\v{c}ec, Martin Jaggi

PDF

Open Access 5 Models 5 Datasets

TL;DR

This paper introduces a transparent, efficient model-based data filtering framework for multilingual LLM pretraining, improving performance and resource efficiency across diverse languages and scripts.

Contribution

It develops a novel, accessible filtering method for multilingual datasets that enhances LLM training efficiency and effectiveness, especially for low-resource languages.

Findings

01

Matching baseline MMLU scores with only 15% of training tokens

02

Improved performance across multiple benchmarks

03

Mitigated the curse of multilinguality

Abstract

Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on English. To address the disparity stemming from limited research on non-English languages, we develop a model-based filtering framework for multilingual datasets that aims to identify a diverse set of structured and knowledge-rich samples. Our approach emphasizes transparency, simplicity, and efficiency, leveraging Transformer- and FastText-based classifiers to ensure the broad accessibility of our technique and data. We conduct comprehensive ablation studies on the FineWeb-2 web crawl dataset across diverse language families, scripts, and resource availability to demonstrate the effectiveness of our method. Training a 1B-parameter Llama model for 70B…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification