Pula: Training Large Language Models for Setswana

Nathan Brown; Vukosi Marivate

arXiv:2408.02239·cs.CL·April 29, 2025

Pula: Training Large Language Models for Setswana

Nathan Brown, Vukosi Marivate

PDF

Open Access 2 Models 1 Video

TL;DR

Pula introduces bilingual Setswana-English language models that outperform existing models on translation and reasoning tasks, supported by new datasets, benchmarks, and open-source resources for Setswana NLP research.

Contribution

The paper presents the first large Setswana language models, new datasets, benchmarks, and open-source code, advancing Setswana NLP capabilities and research infrastructure.

Findings

01

Pula models outperform GPT-4o and Gemini 1.5 Pro on translation tasks.

02

State-of-the-art performance on Setswana reasoning tasks.

03

Release of the largest Setswana text corpus and instruction-tuning dataset.

Abstract

In this work we present Pula, a suite of bilingual language models proficient in both Setswana and English. Leveraging recent advancements in data availability and efficient fine-tuning, Pula 8B and Pula 14B outperform GPT-4o and Gemini 1.5 Pro on English-Setswana translation tasks and achieve state-of-the-art performance on Setswana reasoning tasks for their size. We release the weights for Pula 1B, 3B, 8B, and 14B as well as training logs and training and evaluation code. Alongside Pula, we release the largest-ever Setswana text corpus, Marothodi, and the first comprehensive Setswana instruction-tuning dataset, Medupi, consisting of reformatted datasets, translated corpora, and synthetic LLM-generated text. To accompany this data, we release the code used for dataset construction, formatting, filtering, and scraping. Last, we release two Setswana LLM-translated benchmarks, MMLU-tsn…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

Pula: Training Large Language Models for Setswana· underline

Taxonomy

TopicsNatural Language Processing Techniques