PMIndia -- A Collection of Parallel Corpora of Languages of India

Barry Haddow; Faheem Kirefu

arXiv:2001.09907·cs.CL·January 28, 2020·68 cites

PMIndia -- A Collection of Parallel Corpora of Languages of India

Barry Haddow, Faheem Kirefu

PDF

Open Access 2 Repos 1 Datasets

TL;DR

This paper introduces PMIndia, a new publicly available parallel corpus of 13 Indian languages with English, supporting multilingual NLP and machine translation, along with an evaluation of alignment methods and initial translation results.

Contribution

The paper presents PMIndia, a large-scale parallel corpus for Indian languages, and compares automatic sentence alignment techniques with initial neural machine translation results.

Findings

01

PMIndia contains up to 56,000 sentences per language pair.

02

Automatic alignment methods vary in accuracy, impacting corpus quality.

03

Initial NMT results demonstrate the corpus's utility for translation tasks.

Abstract

Parallel text is required for building high-quality machine translation (MT) systems, as well as for other multilingual NLP applications. For many South Asian languages, such data is in short supply. In this paper, we described a new publicly available corpus (PMIndia) consisting of parallel sentences which pair 13 major languages of India with English. The corpus includes up to 56000 sentences for each language pair. We explain how the corpus was constructed, including an assessment of two different automatic sentence alignment methods, and present some initial NMT results on the corpus.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

PMIndiaData/PMIndiaSum
dataset· 83 dl
83 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Algorithms and Data Compression