Automatic Identification of Closely-related Indian Languages: Resources   and Experiments

Ritesh Kumar; Bornini Lahiri; Deepak Alok; Atul Kr. Ojha; Mayank Jain,; Abdul Basit; Yogesh Dawer

arXiv:1803.09405·cs.CL·March 28, 2018·24 cites

Automatic Identification of Closely-related Indian Languages: Resources and Experiments

Ritesh Kumar, Bornini Lahiri, Deepak Alok, Atul Kr. Ojha, Mayank Jain,, Abdul Basit, Yogesh Dawer

PDF

Open Access

TL;DR

This paper develops a language identification system for five closely related Indian languages, achieving 96.48% accuracy, and studies their lexical similarity using newly compiled corpora.

Contribution

It introduces a new resource of comparable corpora for five Indian languages and presents a state-of-the-art language identification system along with a lexical similarity analysis.

Findings

01

Achieved 96.48% accuracy in language identification

02

Compiled and detailed creation of corpora for five languages

03

First data-based lexical similarity study of these languages

Abstract

In this paper, we discuss an attempt to develop an automatic language identification system for 5 closely-related Indo-Aryan languages of India, Awadhi, Bhojpuri, Braj, Hindi and Magahi. We have compiled a comparable corpora of varying length for these languages from various resources. We discuss the method of creation of these corpora in detail. Using these corpora, a language identification system was developed, which currently gives state of the art accuracy of 96.48\%. We also used these corpora to study the similarity between the 5 languages at the lexical level, which is the first data-based study of the extent of closeness of these languages.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Natural Language Processing Techniques · Translation Studies and Practices