PuoBERTa: Training and evaluation of a curated language model for Setswana
Vukosi Marivate, Moseli Mots'Oehli, Valencia Wagner, Richard Lastrucci, and Isheanesu Dzingirai

TL;DR
This paper introduces PuoBERTa, a specialized language model for Setswana, created through curated data collection and evaluated on multiple NLP tasks, significantly advancing NLP resources for this low-resource language.
Contribution
We developed and evaluated PuoBERTa, the first high-quality Setswana language model, including a new dataset and benchmarks, to improve NLP applications for the language.
Findings
PuoBERTa outperforms baseline models on POS tagging, NER, and news categorisation.
Introduction of a new Setswana news categorisation dataset.
Demonstrated the effectiveness of a curated corpus for low-resource language modeling.
Abstract
Natural language processing (NLP) has made significant progress for well-resourced languages such as English but lagged behind for low-resource languages like Setswana. This paper addresses this gap by presenting PuoBERTa, a customised masked language model trained specifically for Setswana. We cover how we collected, curated, and prepared diverse monolingual texts to generate a high-quality corpus for PuoBERTa's training. Building upon previous efforts in creating monolingual resources for Setswana, we evaluated PuoBERTa across several NLP tasks, including part-of-speech (POS) tagging, named entity recognition (NER), and news categorisation. Additionally, we introduced a new Setswana news categorisation dataset and provided the initial benchmarks using PuoBERTa. Our work demonstrates the efficacy of PuoBERTa in fostering NLP capabilities for understudied languages like Setswana and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
