Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics

Aloka Fernando; Nisansa de Silva; Menan Velyuthan; Charitha Rathnayake; Surangika Ranathunga

arXiv:2502.19074·cs.CL·September 23, 2025

Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics

Aloka Fernando, Nisansa de Silva, Menan Velyuthan, Charitha Rathnayake, Surangika Ranathunga

PDF

Open Access 1 Video

TL;DR

This paper introduces heuristics to reduce bias in web-mined parallel corpora for low-resource languages, improving data quality and NMT performance by addressing model biases and noise.

Contribution

It proposes debiasing heuristics to enhance the quality of web-mined parallel data, leading to more consistent NMT results across different multilingual models.

Findings

01

Debiasing heuristics effectively remove noisy sentence pairs.

02

Curated datasets improve NMT translation quality.

03

Reduced disparity across multiPLMs in NMT training.

Abstract

Parallel Data Curation (PDC) techniques aim to filter out noisy parallel sentences from web-mined corpora. Ranking sentence pairs using similarity scores on sentence embeddings derived from Pre-trained Multilingual Language Models (multiPLMs) is the most common PDC technique. However, previous research has shown that the choice of the multiPLM significantly impacts the quality of the filtered parallel corpus, and the Neural Machine Translation (NMT) models trained using such data show a disparity across multiPLMs. This paper shows that this disparity is due to different multiPLMs being biased towards certain types of sentence pairs, which are treated as noise from an NMT point of view. We show that such noisy parallel sentences can be removed to a certain extent by employing a series of heuristics. The NMT models, trained using the curated corpus, lead to producing better results while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Improving the Quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics· underline

Taxonomy

TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling