INDIC DIALECT: A Multi Task Benchmark to Evaluate and Translate in Indian Language Dialects
Tarun Sharma, Manikandan Ravikiran, Sourava Kumar Behera, Pramit Bhattacharya, Arnab Bhattacharya, and Rohit Saluja

TL;DR
This paper introduces INDIC-DIALECT, a comprehensive benchmark dataset for Indian language dialects, including tasks like classification, question answering, and translation, highlighting the challenges and potential solutions for low-resource dialect NLP.
Contribution
The creation of a large, human-curated parallel corpus of 13,000 sentence pairs across 11 dialects and 2 languages, and the development of a multi-task benchmark for dialect classification, translation, and question answering.
Findings
Fine-tuned models significantly outperform LLMs on dialect classification.
Hybrid AI models achieve high BLEU scores in dialect-to-language translation.
Rule-based plus AI approach excels in language-to-dialect translation.
Abstract
Recent NLP advances focus primarily on standardized languages, leaving most low-resource dialects under-served especially in Indian scenarios. In India, the issue is particularly important: despite Hindi being the third most spoken language globally (over 600 million speakers), its numerous dialects remain underrepresented. The situation is similar for Odia, which has around 45 million speakers. While some datasets exist which contain standard Hindi and Odia languages, their regional dialects have almost no web presence. We introduce INDIC-DIALECT, a human-curated parallel corpus of 13k sentence pairs spanning 11 dialects and 2 languages: Hindi and Odia. Using this corpus, we construct a multi-task benchmark with three tasks: dialect classification, multiple-choice question (MCQ) answering, and machine translation (MT). Our experiments show that LLMs like GPT-4o and Gemini 2.5 perform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Linguistic Variation and Morphology
