Assessing the Role of Data Quality in Training Bilingual Language Models

Skyler Seto; Maartje ter Hoeve; Maureen de Seyssel; David Grangier

arXiv:2506.12966·cs.CL·June 17, 2025

Assessing the Role of Data Quality in Training Bilingual Language Models

Skyler Seto, Maartje ter Hoeve, Maureen de Seyssel, David Grangier

PDF

Open Access 1 Video

TL;DR

This paper investigates how data quality impacts the performance of bilingual language models, revealing that filtering high-quality data can significantly improve multilingual NLP outcomes across various languages.

Contribution

It introduces a data filtering strategy that enhances bilingual model performance by prioritizing high-quality data, addressing a key challenge in multilingual NLP.

Findings

01

Data quality significantly affects bilingual model performance.

02

Filtering high-quality data reduces performance gaps between languages.

03

The approach improves monolingual performance by 2-4%.

Abstract

Bilingual and multilingual language models offer a promising path toward scaling NLP systems across diverse languages and users. However, their performance often varies wildly between languages as prior works show that adding more languages can degrade performance for some languages (such as English), while improving others (typically more data constrained languages). In this work, we investigate causes of these inconsistencies by comparing bilingual and monolingual language models. Our analysis reveals that unequal data quality, not just data quantity, is a major driver of performance degradation in bilingual settings. We propose a simple yet effective data filtering strategy to select higher-quality bilingual training data with only high quality English data. Applied to French, German, and Chinese, our approach improves monolingual performance by 2-4% and reduces bilingual model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Assessing the Role of Data Quality in Training Bilingual Language Models· underline

Taxonomy

TopicsData Quality and Management · Natural Language Processing Techniques · Topic Modeling