IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages
Jay Gala, Pranjal A. Chitale, Raghavan AK, Varun Gumma and, Sumanth Doddapaneni, Aswanth Kumar, Janki Nawale, Anupama Sujatha, and Ratish Puduppully, Vivek Raghavan, Pratyush Kumar, Mitesh M., Khapra, Raj Dabre, Anoop Kunchukuttan

TL;DR
This paper introduces IndicTrans2, a comprehensive machine translation system supporting all 22 scheduled Indian languages, through new datasets, benchmarks, and models to enhance accessibility and quality of translation in India.
Contribution
It provides the largest parallel corpus for Indic languages, a new comprehensive benchmark, and a multilingual model supporting all 22 languages with open access.
Findings
IndicTrans2 outperforms existing models on multiple benchmarks.
The Bharat Parallel Corpus Collection is the largest for Indic languages.
Open access release promotes wider adoption and research.
Abstract
India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people. 22 of these languages are listed in the Constitution of India (referred to as scheduled languages) are the focus of this work. Given the linguistic diversity, high-quality and accessible Machine Translation (MT) systems are essential in a country like India. Prior to this work, there was (i) no parallel training data spanning all 22 languages, (ii) no robust benchmarks covering all these languages and containing content relevant to India, and (iii) no existing translation models which support all the 22 scheduled languages of India. In this work, we aim to address this gap by focusing on the missing pieces required for enabling wide, easy, and open access to good machine translation systems for all 22 scheduled Indian languages. We identify four key areas of improvement:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsTest · Focus
