Bhasha-Abhijnaanam: Native-script and romanized Language Identification for 22 Indic languages
Yash Madhani, Mitesh M. Khapra, Anoop Kunchukuttan

TL;DR
This paper introduces Bhasha-Abhijnaanam, a comprehensive dataset and IndicLID, a language identifier for 22 Indian languages in native and romanized scripts, addressing data scarcity and similarity challenges, with publicly available resources.
Contribution
The paper presents the first romanized script language identification model for Indian languages and provides extensive datasets for native and romanized text.
Findings
IndicLID outperforms existing models on native-script data.
IndicLID achieves competitive accuracy on romanized text.
Public datasets and models are freely accessible for research.
Abstract
We create publicly available language identification (LID) datasets and models in all 22 Indian languages listed in the Indian constitution in both native-script and romanized text. First, we create Bhasha-Abhijnaanam, a language identification test set for native-script as well as romanized text which spans all 22 Indic languages. We also train IndicLID, a language identifier for all the above-mentioned languages in both native and romanized script. For native-script text, it has better language coverage than existing LIDs and is competitive or better than other LIDs. IndicLID is the first LID for romanized text in Indian languages. Two major challenges for romanized text LID are the lack of training data and low-LID performance when languages are similar. We provide simple and effective solutions to these problems. In general, there has been limited work on romanized text in any…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Translation Studies and Practices
MethodsTest
