Pula: Training Large Language Models for Setswana
Nathan Brown, Vukosi Marivate

TL;DR
Pula introduces bilingual Setswana-English language models that outperform existing models on translation and reasoning tasks, supported by new datasets, benchmarks, and open-source resources for Setswana NLP research.
Contribution
The paper presents the first large Setswana language models, new datasets, benchmarks, and open-source code, advancing Setswana NLP capabilities and research infrastructure.
Findings
Pula models outperform GPT-4o and Gemini 1.5 Pro on translation tasks.
State-of-the-art performance on Setswana reasoning tasks.
Release of the largest Setswana text corpus and instruction-tuning dataset.
Abstract
In this work we present Pula, a suite of bilingual language models proficient in both Setswana and English. Leveraging recent advancements in data availability and efficient fine-tuning, Pula 8B and Pula 14B outperform GPT-4o and Gemini 1.5 Pro on English-Setswana translation tasks and achieve state-of-the-art performance on Setswana reasoning tasks for their size. We release the weights for Pula 1B, 3B, 8B, and 14B as well as training logs and training and evaluation code. Alongside Pula, we release the largest-ever Setswana text corpus, Marothodi, and the first comprehensive Setswana instruction-tuning dataset, Medupi, consisting of reformatted datasets, translated corpora, and synthetic LLM-generated text. To accompany this data, we release the code used for dataset construction, formatting, filtering, and scraping. Last, we release two Setswana LLM-translated benchmarks, MMLU-tsn…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
