MzansiText and MzansiLM: An Open Corpus and Decoder-Only Language Model for South African Languages
Anri Lombard, Simbarashe Mawere, Temi Aina, Ethan Wolff, Sbonelo Gumede, Elan Novick, Francois Meyer, Jan Buys

TL;DR
This paper introduces MzansiText, a multilingual corpus, and MzansiLM, a small decoder-only language model, to support South African languages, demonstrating effective adaptation for NLU and NLG tasks despite low-resource constraints.
Contribution
The paper presents the first publicly available decoder-only model and corpus for all eleven South African languages, with detailed evaluation of adaptation strategies at small scale.
Findings
Strong performance on data-to-text generation for isiXhosa.
Effective multilingual finetuning for topic classification.
Few-shot reasoning remains challenging at this model size.
Abstract
Decoder-only language models can be adapted to diverse tasks through instruction finetuning, but the extent to which this generalizes at small scale for low-resource languages remains unclear. We focus on the languages of South Africa, where we are not aware of a publicly available decoder-only model that explicitly targets all eleven official written languages, nine of which are low-resource. We introduce MzansiText, a curated multilingual pretraining corpus with a reproducible filtering pipeline, and MzansiLM, a 125M-parameter language model trained from scratch. We evaluate MzansiLM on natural language understanding and generation using three adaptation regimes: monolingual task-specific finetuning, multilingual task-specific finetuning, and general multi-task instruction finetuning. Monolingual task-specific finetuning achieves strong performance on data-to-text generation, reaching…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Computational and Text Analysis Methods · Text Readability and Simplification
