Using English as Pivot to Extract Persian-Italian Parallel Sentences from Non-Parallel Corpora
Ebrahim Ansari, M.H. Sadreddini, Mostafa Sheikhalishahi, Richard, Wallace, Fatemeh Alimardani

TL;DR
This paper introduces a novel method for extracting Persian-Italian parallel sentences from non-parallel corpora using English as a pivot, enhancing low-resource SMT training data quality.
Contribution
It proposes a new pivot-based extraction approach with a novel NGD similarity metric, improving the quality of bilingual corpora for low-resource language pairs.
Findings
Significant increase in bilingual corpus quality
Improved SMT performance with the extracted data
Effective use of NGD for sentence similarity
Abstract
The effectiveness of a statistical machine translation system (SMT) is very dependent upon the amount of parallel corpus used in the training phase. For low-resource language pairs there are not enough parallel corpora to build an accurate SMT. In this paper, a novel approach is presented to extract bilingual Persian-Italian parallel sentences from a non-parallel (comparable) corpus. In this study, English is used as the pivot language to compute the matching scores between source and target sentences and candidate selection phase. Additionally, a new monolingual sentence similarity metric, Normalized Google Distance (NGD) is proposed to improve the matching process. Moreover, some extensions of the baseline system are applied to improve the quality of extracted sentences measured with BLEU. Experimental results show that using the new pivot based extraction can increase the quality of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling
