Improving the quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics
Aloka Fernando, Nisansa de Silva, Menan Velyuthan, Charitha Rathnayake, Surangika Ranathunga

TL;DR
This paper introduces heuristics to reduce bias in web-mined parallel corpora for low-resource languages, improving data quality and NMT performance by addressing model biases and noise.
Contribution
It proposes debiasing heuristics to enhance the quality of web-mined parallel data, leading to more consistent NMT results across different multilingual models.
Findings
Debiasing heuristics effectively remove noisy sentence pairs.
Curated datasets improve NMT translation quality.
Reduced disparity across multiPLMs in NMT training.
Abstract
Parallel Data Curation (PDC) techniques aim to filter out noisy parallel sentences from web-mined corpora. Ranking sentence pairs using similarity scores on sentence embeddings derived from Pre-trained Multilingual Language Models (multiPLMs) is the most common PDC technique. However, previous research has shown that the choice of the multiPLM significantly impacts the quality of the filtered parallel corpus, and the Neural Machine Translation (NMT) models trained using such data show a disparity across multiPLMs. This paper shows that this disparity is due to different multiPLMs being biased towards certain types of sentence pairs, which are treated as noise from an NMT point of view. We show that such noisy parallel sentences can be removed to a certain extent by employing a series of heuristics. The NMT models, trained using the curated corpus, lead to producing better results while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
