ILID: Native Script Language Identification for Indian Languages

Yash Ingle; Pruthwik Mishra

arXiv:2507.11832·cs.CL·January 8, 2026

ILID: Native Script Language Identification for Indian Languages

Yash Ingle, Pruthwik Mishra

PDF

Open Access 1 Models 1 Datasets

TL;DR

This paper introduces ILID, a comprehensive dataset and baseline models for identifying 23 Indian languages in noisy, short, and code-mixed texts, addressing a challenging NLP task with diverse linguistic features.

Contribution

The paper creates a large, annotated dataset of 250K sentences for 23 Indian languages and develops baseline models that outperform existing transformer-based approaches.

Findings

01

Baseline models outperform previous state-of-the-art transformers.

02

New dataset with 250K sentences for 23 Indian languages.

03

Models perform well in noisy and code-mixed environments.

Abstract

The language identification task is a crucial fundamental step in NLP. Often it serves as a pre-processing step for widely used NLP applications such as multilingual machine translation, information retrieval, question and answering, and text summarization. The core challenge of language identification lies in distinguishing languages in noisy, short, and code-mixed environments. This becomes even harder in case of diverse Indian languages that exhibit lexical and phonetic similarities, but have distinct differences. Many Indian languages share the same script, making the task even more challenging. Taking all these challenges into account, we develop and release a dataset of 250K sentences consisting of 23 languages including English and all 22 official Indian languages labeled with their language identifiers, where data in most languages are newly created. We also develop and release…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
pruthwik/ilid-muril-model
model· 7 dl
7 dl

Datasets

yash-ingle/ILID_Indian_Language_Identification_Dataset
dataset· 81 dl
81 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Translation Studies and Practices