Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models
Mehdi Ali, Manuel Brack, Max L\"ubbering, Elias Wendt, Abbas Goher Khan, Richard Rutmann, Alex Jude, Maurice Kraus, Alexander Arno Weber, David Kacz\'er, Florian Mai, Lucie Flek, Rafet Sifa, Nicolas Flores-Herr, Joachim K\"ohler, Patrick Schramowski, Michael Fromm

TL;DR
This paper presents JQL, a scalable, efficient method for filtering high-quality multilingual data for language model pretraining, outperforming heuristic methods and improving cross-lingual transferability.
Contribution
Introducing JQL, a novel multilingual data filtering approach that leverages pretrained embeddings to improve data quality and scalability for large language models.
Findings
JQL outperforms heuristic filtering methods like Fineweb2.
JQL improves downstream model training quality.
JQL increases data retention rates across 35 languages.
Abstract
High-quality multilingual training data is essential for effectively pretraining large language models (LLMs). Yet, the availability of suitable open-source multilingual datasets remains limited. Existing state-of-the-art datasets mostly rely on heuristic filtering methods, restricting both their cross-lingual transferability and scalability. Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale while significantly reducing computational demands. JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings. These models exhibit robust multilingual and cross-lingual performance, even for languages and scripts unseen during training. Evaluated empirically across 35 languages, the resulting annotation pipeline substantially outperforms current heuristic filtering…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Mobile Crowdsensing and Crowdsourcing
