Development of a Dataset and a Deep Learning Baseline Named Entity Recognizer for Three Low Resource Languages: Bhojpuri, Maithili and Magahi
Rajesh Kumar Mundotiya, Shantanu Kumar, Ajeet kumar, Umesh Chandra, Chaudhary, Supriya Chauhan, Swasti Mishra, Praveen Gatla, Anil Kumar Singh

TL;DR
This paper creates a NER dataset for Bhojpuri, Maithili, and Magahi, low-resource languages, and establishes a deep learning baseline using LSTM-CNNs-CRF models to improve entity recognition accuracy.
Contribution
It introduces the first annotated NER benchmark dataset for three low-resource languages and evaluates a deep learning baseline for NER in these languages.
Findings
Deep learning baseline achieves comparable or better F1-scores than CRF models.
Annotated corpora with 22 entity labels for each language.
Provides a foundation for future NLP tasks in low-resource languages.
Abstract
In Natural Language Processing (NLP) pipelines, Named Entity Recognition (NER) is one of the preliminary problems, which marks proper nouns and other named entities such as Location, Person, Organization, Disease etc. Such entities, without a NER module, adversely affect the performance of a machine translation system. NER helps in overcoming this problem by recognising and handling such entities separately, although it can be useful in Information Extraction systems also. Bhojpuri, Maithili and Magahi are low resource languages, usually known as Purvanchal languages. This paper focuses on the development of a NER benchmark dataset for the Machine Translation systems developed to translate from these languages to Hindi by annotating parts of their available corpora. Bhojpuri, Maithili and Magahi corpora of sizes 228373, 157468 and 56190 tokens, respectively, were annotated using 22…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Translation Studies and Practices
