Bootstrapping Method for Developing Part-of-Speech Tagged Corpus in Low Resource Languages Tagset - A Focus on an African Igbo
Onyenwe Ikechukwu E, Onyedinma Ebele G, Aniegwu Godwin E, Ezeani, Ignatius M

TL;DR
This paper presents a bootstrapping approach combining cross-lingual and monolingual methods to develop a POS tagged corpus for low-resource languages, demonstrated on Igbo, improving annotation accuracy and efficiency.
Contribution
It introduces a novel combined bootstrapping method leveraging parallel texts and NLP resources to create POS tagged corpora for low-resource languages like Igbo.
Findings
Accuracy of POS tagging improved from 6.13% to 83.79%.
Tags transformation rate increased from 8.67% to 98.37%.
Method effectively accelerates POS corpus development for low-resource languages.
Abstract
Most languages, especially in Africa, have fewer or no established part-of-speech (POS) tagged corpus. However, POS tagged corpus is essential for natural language processing (NLP) to support advanced researches such as machine translation, speech recognition, etc. Even in cases where there is no POS tagged corpus, there are some languages for which parallel texts are available online. The task of POS tagging a new language corpus with a new tagset usually face a bootstrapping problem at the initial stages of the annotation process. The unavailability of automatic taggers to help the human annotator makes the annotation process to appear infeasible to quickly produce adequate amounts of POS tagged corpus for advanced NLP research and training the taggers. In this paper, we demonstrate the efficacy of a POS annotation method that employed the services of two automatic approaches to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems
