Using English as Pivot to Extract Persian-Italian Parallel Sentences   from Non-Parallel Corpora

Ebrahim Ansari; M.H. Sadreddini; Mostafa Sheikhalishahi; Richard; Wallace; Fatemeh Alimardani

arXiv:1701.08339·cs.CL·January 31, 2017·2 cites

Using English as Pivot to Extract Persian-Italian Parallel Sentences from Non-Parallel Corpora

Ebrahim Ansari, M.H. Sadreddini, Mostafa Sheikhalishahi, Richard, Wallace, Fatemeh Alimardani

PDF

Open Access

TL;DR

This paper introduces a novel method for extracting Persian-Italian parallel sentences from non-parallel corpora using English as a pivot, enhancing low-resource SMT training data quality.

Contribution

It proposes a new pivot-based extraction approach with a novel NGD similarity metric, improving the quality of bilingual corpora for low-resource language pairs.

Findings

01

Significant increase in bilingual corpus quality

02

Improved SMT performance with the extracted data

03

Effective use of NGD for sentence similarity

Abstract

The effectiveness of a statistical machine translation system (SMT) is very dependent upon the amount of parallel corpus used in the training phase. For low-resource language pairs there are not enough parallel corpora to build an accurate SMT. In this paper, a novel approach is presented to extract bilingual Persian-Italian parallel sentences from a non-parallel (comparable) corpus. In this study, English is used as the pivot language to compute the matching scores between source and target sentences and candidate selection phase. Additionally, a new monolingual sentence similarity metric, Normalized Google Distance (NGD) is proposed to improve the matching process. Moreover, some extensions of the baseline system are applied to improve the quality of extracted sentences measured with BLEU. Experimental results show that using the new pivot based extraction can increase the quality of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling