Toward Cross-Lingual Quality Classifiers for Multilingual Pretraining Data Selection
Yassine Turki, Vinko Sabol\v{c}ec, Bettina Messmer, Martin Jaggi

TL;DR
This paper explores cross-lingual quality classifiers for multilingual data filtering in LLM training, showing that multilingual pooling can improve quality assessment across languages, especially when combined with targeted tuning.
Contribution
It demonstrates that cross-lingual embedding-based quality markers can enable effective multilingual data filtering, reducing reliance on language-specific high-quality data.
Findings
Multilingual pooling outperforms monolingual baselines in rank stability and accuracy.
Refining decision boundaries with Q3 or retention tuning enhances filtering for high-resource languages.
Scale alone does not ensure stability; targeted tuning is necessary.
Abstract
As Large Language Models (LLMs) scale, data curation has shifted from maximizing volume to optimizing the signal-to-noise ratio by performing quality filtering. However, for many languages, native high quality data is insufficient to train robust quality classifiers. This work investigates the idea that quality markers in embedding space may show cross-lingual consistency, which would allow high-resource languages to subsidize the filtering of low-resource ones. We evaluate various filtering strategies, including cross-lingual transfer, third quartile sampling (Q3), and retention rate tuning. Our results demonstrate that massive multilingual pooling frequently outperforms monolingual baselines in both rank stability and aggregate accuracy for a 1B model trained on 103B tokens, delivering gains for high resource languages (1.2% increase in aggregate normalized accuracy for French) and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
