Score Combination for Improved Parallel Corpus Filtering for Low Resource Conditions
Muhammad N. ElNokrashy, Amr Hendy, Mohamed Abdelghaffar, Mohamed, Afify, Ahmed Tawfik, Hany Hassan Awadalla

TL;DR
This paper presents a method for filtering parallel corpora in low-resource language conditions by combining multiple scoring techniques, resulting in improved translation quality as measured by sacreBLEU scores.
Contribution
The authors introduce a novel score combination approach utilizing LASER, a semantic classifier, and original devkit scores to enhance corpus filtering in low-resource scenarios.
Findings
7% and 5% relative BLEU score improvements for Pashto and Khmer
Effective combination of multiple scoring methods for corpus filtering
Demonstrated benefits in low-resource machine translation tasks
Abstract
This paper describes our submission to the WMT20 sentence filtering task. We combine scores from (1) a custom LASER built for each source language, (2) a classifier built to distinguish positive and negative pairs by semantic alignment, and (3) the original scores included in the task devkit. For the mBART finetuning setup, provided by the organizers, our method shows 7% and 5% relative improvement over baseline, in sacreBLEU score on the test set for Pashto and Khmer respectively.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsmBART
