Revisiting Low Resource Status of Indian Languages in Machine   Translation

Jerin Philip; Shashank Siripragada; Vinay P. Namboodiri; C.V. Jawahar

arXiv:2008.04860·cs.CL·November 5, 2020

Revisiting Low Resource Status of Indian Languages in Machine Translation

Jerin Philip, Shashank Siripragada, Vinay P. Namboodiri, C.V. Jawahar

PDF

2 Datasets

TL;DR

This paper introduces an automated framework to create larger, more effective parallel corpora for Indian language machine translation, significantly improving translation quality on standard benchmarks.

Contribution

It presents an iterative, automated pipeline for corpus creation that enhances data size and quality for Indian language NMT systems, with analysis of key design choices.

Findings

01

Larger corpus improves translation performance on WAT benchmark

02

Iterative pipeline effectively increases corpus size and quality

03

Choice of pivot language impacts translation results

Abstract

Indian language machine translation performance is hampered due to the lack of large scale multi-lingual sentence aligned corpora and robust benchmarks. Through this paper, we provide and analyse an automated framework to obtain such a corpus for Indian language neural machine translation (NMT) systems. Our pipeline consists of a baseline NMT system, a retrieval module, and an alignment module that is used to work with publicly available websites such as press releases by the government. The main contribution towards this effort is to obtain an incremental method that uses the above pipeline to iteratively improve the size of the corpus as well as improve each of the components of our system. Through our work, we also evaluate the design choices such as the choice of pivoting language and the effect of iterative incremental increase in corpus size. Our work in addition to providing an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.