Linguistic Resources for Bhojpuri, Magahi and Maithili: Statistics about them, their Similarity Estimates, and Baselines for Three Applications
Rajesh Kumar Mundotiya, Manish Kumar Singh, Rahul Kapur, Swasti, Mishra, Anil Kumar Singh

TL;DR
This paper compiles and analyzes linguistic resources for Bhojpuri, Magahi, and Maithili, comparing their statistical and linguistic properties with Hindi, and establishing baselines for NLP tasks in these low-resource languages.
Contribution
It provides the first comprehensive statistical and annotated linguistic corpora for Bhojpuri, Magahi, and Maithili, along with comparison to Hindi and baseline results for NLP applications.
Findings
Corpora exhibit diverse morphological, lexical, phonological, and syntactic properties.
POS tagging and chunking datasets are established for each language.
Comparative analysis reveals linguistic similarities and differences among the languages.
Abstract
Corpus preparation for low-resource languages and for development of human language technology to analyze or computationally process them is a laborious task, primarily due to the unavailability of expert linguists who are native speakers of these languages and also due to the time and resources required. Bhojpuri, Magahi, and Maithili, languages of the Purvanchal region of India (in the north-eastern parts), are low-resource languages belonging to the Indo-Aryan (or Indic) family. They are closely related to Hindi, which is a relatively high-resource language, which is why we compare with Hindi. We collected corpora for these three languages from various sources and cleaned them to the extent possible, without changing the data in them. The text belongs to different domains and genres. We calculated some basic statistical measures for these corpora at character, word, syllable, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling
