Challenges in Developing LRs for Non-Scheduled Languages: A Case of   Magahi

Ritesh Kumar

arXiv:2111.15322·cs.CL·December 1, 2021·1 cites

Challenges in Developing LRs for Non-Scheduled Languages: A Case of Magahi

Ritesh Kumar

PDF

Open Access

TL;DR

This paper discusses the challenges of developing language resources for Magahi, a non-scheduled Indo-Aryan language in India, and presents an annotated corpus created from various sources.

Contribution

It introduces the first annotated corpus for Magahi, addressing resource scarcity for non-scheduled languages and providing a foundation for future language technology development.

Findings

01

Created an annotated POS corpus for Magahi from diverse sources

02

Highlights the scarcity of language resources for non-scheduled languages

03

Provides a basis for future NLP research in Magahi

Abstract

Magahi is an Indo-Aryan Language, spoken mainly in the Eastern parts of India. Despite having a significant number of speakers, there has been virtually no language resource (LR) or language technology (LT) developed for the language, mainly because of its status as a non-scheduled language. The present paper describes an attempt to develop an annotated corpus of Magahi. The data is mainly taken from a couple of blogs in Magahi, some collection of stories in Magahi and the recordings of conversation in Magahi and it is annotated at the POS level using BIS tagset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Translation Studies and Practices · Text Readability and Simplification