Aksharantar: Open Indic-language Transliteration datasets and models for the Next Billion Users
Yash Madhani, Sushane Parthan, Priyanka Bedekar, Gokul NC, Ruchi, Khapra, Anoop Kunchukuttan, Pratyush Kumar, Mitesh M. Khapra

TL;DR
This paper introduces Aksharantar, the largest open dataset for Indic language transliteration, along with a multilingual model that significantly improves transliteration accuracy and provides resources to advance research in this area.
Contribution
The paper presents a large-scale, publicly available transliteration dataset for 21 Indic languages and a new multilingual transliteration model, establishing strong baselines and enabling further research.
Findings
Aksharantar dataset contains 26 million transliteration pairs for 21 languages.
The IndicXlit model improves accuracy by 15% on the Dakshina test set.
The dataset and models are openly available for research and development.
Abstract
Transliteration is very important in the Indian language context due to the usage of multiple scripts and the widespread use of romanized inputs. However, few training and evaluation sets are publicly available. We introduce Aksharantar, the largest publicly available transliteration dataset for Indian languages created by mining from monolingual and parallel corpora, as well as collecting data from human annotators. The dataset contains 26 million transliteration pairs for 21 Indic languages from 3 language families using 12 scripts. Aksharantar is 21 times larger than existing datasets and is the first publicly available dataset for 7 languages and 1 language family. We also introduce the Aksharantar testset comprising 103k word pairs spanning 19 languages that enables a fine-grained analysis of transliteration models on native origin words, foreign words, frequent words, and rare…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Translation Studies and Practices
