Pre-training a Transformer-Based Generative Model Using a Small Sepedi Dataset
Simon P. Ramalepe, Thipe I. Modipa, Marelie H. Davel

TL;DR
This paper explores pre-training transformer-based language models for low-resource Sepedi language using novel datasets, comparing occlusion-based and non-occlusion techniques, and evaluates their performance on text generation tasks.
Contribution
It introduces two new Sepedi datasets and systematically compares occlusion and non-occlusion pre-training methods for low-resource language modeling.
Findings
Non-occlusion models outperform occlusion models in validation loss and perplexity.
Occlusion models achieve higher BLEU scores, indicating better text quality.
New datasets enable effective pre-training for low-resource language NLP tasks.
Abstract
Due to the scarcity of data in low-resourced languages, the development of language models for these languages has been very slow. Currently, pre-trained language models have gained popularity in natural language processing, especially, in developing domain-specific models for low-resourced languages. In this study, we experiment with the impact of using occlusion-based techniques when training a language model for a text generation task. We curate 2 new datasets, the Sepedi monolingual (SepMono) dataset from several South African resources and the Sepedi radio news (SepNews) dataset from the radio news domain. We use the SepMono dataset to pre-train transformer-based models using the occlusion and non-occlusion pre-training techniques and compare performance. The SepNews dataset is specifically used for fine-tuning. Our results show that the non-occlusion models perform better compared…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications
