Automatic Identification of Closely-related Indian Languages: Resources and Experiments
Ritesh Kumar, Bornini Lahiri, Deepak Alok, Atul Kr. Ojha, Mayank Jain,, Abdul Basit, Yogesh Dawer

TL;DR
This paper develops a language identification system for five closely related Indian languages, achieving 96.48% accuracy, and studies their lexical similarity using newly compiled corpora.
Contribution
It introduces a new resource of comparable corpora for five Indian languages and presents a state-of-the-art language identification system along with a lexical similarity analysis.
Findings
Achieved 96.48% accuracy in language identification
Compiled and detailed creation of corpora for five languages
First data-based lexical similarity study of these languages
Abstract
In this paper, we discuss an attempt to develop an automatic language identification system for 5 closely-related Indo-Aryan languages of India, Awadhi, Bhojpuri, Braj, Hindi and Magahi. We have compiled a comparable corpora of varying length for these languages from various resources. We discuss the method of creation of these corpora in detail. Using these corpora, a language identification system was developed, which currently gives state of the art accuracy of 96.48\%. We also used these corpora to study the similarity between the 5 languages at the lexical level, which is the first data-based study of the extent of closeness of these languages.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Translation Studies and Practices
