MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining
Zhixun Chen, Ping Guo, Wenhan Han, Yifan Zhang, Binbin Liu, Haobin Lin, Fengze Liu, Yan Zhao, Bingni Zhang, Taifeng Wang, Yin Zheng, Trevor Cohn, Meng Fang

TL;DR
MuRating is a scalable framework that transfers English data quality signals to multiple languages, improving multilingual large language model pretraining by selecting high-quality data across 17 languages.
Contribution
Introduces MuRating, a novel multilingual data selection method that leverages English quality signals and translation to enhance LLM pretraining across diverse languages.
Findings
Boosts accuracy on English benchmarks and multilingual evaluations.
Achieves significant gains on knowledge-intensive tasks.
Effectively balances multilingual data selection for pretraining.
Abstract
Data quality is a critical driver of large language model performance, yet existing model-based selection methods focus almost exclusively on English. We introduce MuRating, a scalable framework that transfers high-quality English data-quality signals into a single rater for 17 target languages. MuRating aggregates multiple English "raters" via pairwise comparisons to learn unified document-quality scores,then projects these judgments through translation to train a multilingual evaluator on monolingual, cross-lingual, and parallel text pairs. Applied to web data, MuRating selects balanced subsets of English and multilingual content to pretrain a 1.2 B-parameter LLaMA model. Compared to strong baselines, including QuRater, AskLLM, DCLM and so on, our approach boosts average accuracy on both English benchmarks and multilingual evaluations, with especially large gains on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Computational and Text Analysis Methods
