ILID: Native Script Language Identification for Indian Languages
Yash Ingle, Pruthwik Mishra

TL;DR
This paper introduces ILID, a comprehensive dataset and baseline models for identifying 23 Indian languages in noisy, short, and code-mixed texts, addressing a challenging NLP task with diverse linguistic features.
Contribution
The paper creates a large, annotated dataset of 250K sentences for 23 Indian languages and develops baseline models that outperform existing transformer-based approaches.
Findings
Baseline models outperform previous state-of-the-art transformers.
New dataset with 250K sentences for 23 Indian languages.
Models perform well in noisy and code-mixed environments.
Abstract
The language identification task is a crucial fundamental step in NLP. Often it serves as a pre-processing step for widely used NLP applications such as multilingual machine translation, information retrieval, question and answering, and text summarization. The core challenge of language identification lies in distinguishing languages in noisy, short, and code-mixed environments. This becomes even harder in case of diverse Indian languages that exhibit lexical and phonetic similarities, but have distinct differences. Many Indian languages share the same script, making the task even more challenging. Taking all these challenges into account, we develop and release a dataset of 250K sentences consisting of 23 languages including English and all 22 official Indian languages labeled with their language identifiers, where data in most languages are newly created. We also develop and release…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Translation Studies and Practices
