Challenges in Developing LRs for Non-Scheduled Languages: A Case of Magahi
Ritesh Kumar

TL;DR
This paper discusses the challenges of developing language resources for Magahi, a non-scheduled Indo-Aryan language in India, and presents an annotated corpus created from various sources.
Contribution
It introduces the first annotated corpus for Magahi, addressing resource scarcity for non-scheduled languages and providing a foundation for future language technology development.
Findings
Created an annotated POS corpus for Magahi from diverse sources
Highlights the scarcity of language resources for non-scheduled languages
Provides a basis for future NLP research in Magahi
Abstract
Magahi is an Indo-Aryan Language, spoken mainly in the Eastern parts of India. Despite having a significant number of speakers, there has been virtually no language resource (LR) or language technology (LT) developed for the language, mainly because of its status as a non-scheduled language. The present paper describes an attempt to develop an annotated corpus of Magahi. The data is mainly taken from a couple of blogs in Magahi, some collection of stories in Magahi and the recordings of conversation in Magahi and it is annotated at the POS level using BIS tagset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Translation Studies and Practices · Text Readability and Simplification
