PMIndia -- A Collection of Parallel Corpora of Languages of India
Barry Haddow, Faheem Kirefu

TL;DR
This paper introduces PMIndia, a new publicly available parallel corpus of 13 Indian languages with English, supporting multilingual NLP and machine translation, along with an evaluation of alignment methods and initial translation results.
Contribution
The paper presents PMIndia, a large-scale parallel corpus for Indian languages, and compares automatic sentence alignment techniques with initial neural machine translation results.
Findings
PMIndia contains up to 56,000 sentences per language pair.
Automatic alignment methods vary in accuracy, impacting corpus quality.
Initial NMT results demonstrate the corpus's utility for translation tasks.
Abstract
Parallel text is required for building high-quality machine translation (MT) systems, as well as for other multilingual NLP applications. For many South Asian languages, such data is in short supply. In this paper, we described a new publicly available corpus (PMIndia) consisting of parallel sentences which pair 13 major languages of India with English. The corpus includes up to 56000 sentences for each language pair. We explain how the corpus was constructed, including an assessment of two different automatic sentence alignment methods, and present some initial NMT results on the corpus.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Algorithms and Data Compression
