Score Combination for Improved Parallel Corpus Filtering for Low   Resource Conditions

Muhammad N. ElNokrashy; Amr Hendy; Mohamed Abdelghaffar; Mohamed; Afify; Ahmed Tawfik; Hany Hassan Awadalla

arXiv:2011.07933·cs.CL·November 17, 2020

Score Combination for Improved Parallel Corpus Filtering for Low Resource Conditions

Muhammad N. ElNokrashy, Amr Hendy, Mohamed Abdelghaffar, Mohamed, Afify, Ahmed Tawfik, Hany Hassan Awadalla

PDF

TL;DR

This paper presents a method for filtering parallel corpora in low-resource language conditions by combining multiple scoring techniques, resulting in improved translation quality as measured by sacreBLEU scores.

Contribution

The authors introduce a novel score combination approach utilizing LASER, a semantic classifier, and original devkit scores to enhance corpus filtering in low-resource scenarios.

Findings

01

7% and 5% relative BLEU score improvements for Pashto and Khmer

02

Effective combination of multiple scoring methods for corpus filtering

03

Demonstrated benefits in low-resource machine translation tasks

Abstract

This paper describes our submission to the WMT20 sentence filtering task. We combine scores from (1) a custom LASER built for each source language, (2) a classifier built to distinguish positive and negative pairs by semantic alignment, and (3) the original scores included in the task devkit. For the mBART finetuning setup, provided by the organizers, our method shows 7% and 5% relative improvement over baseline, in sacreBLEU score on the test set for Pashto and Khmer respectively.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsmBART