Mafoko: Structuring and Building Open Multilingual Terminologies for South African NLP
Vukosi Marivate, Isheanesu Dzingirai, Fiskani Banda, Richard Lastrucci, Thapelo Sindane, Keabetswe Madumo, Kayode Olaleye, Abiodun Modupe, Unarine Netshifhefhe, Herkulaas Combrink, Mohlatlego Nakeng, Matome Ledwaba

TL;DR
Mafoko creates an open, standardized dataset of South African languages' terminologies, enhancing multilingual NLP and machine translation accuracy through integration into RAG pipelines.
Contribution
It systematically aggregates and standardizes fragmented terminological data into an open dataset, enabling improved NLP applications for South African languages.
Findings
Enhanced machine translation accuracy for English-to-Tshivenda.
Demonstrated utility of Mafoko in a RAG pipeline.
Provided a scalable, open resource for South African NLP.
Abstract
The critical lack of structured terminological data for South Africa's official languages hampers progress in multilingual NLP, despite the existence of numerous government and academic terminology lists. These valuable assets remain fragmented and locked in non-machine-readable formats, rendering them unusable for computational research and development. Mafoko addresses this challenge by systematically aggregating, cleaning, and standardising these scattered resources into open, interoperable datasets. We introduce the foundational Mafoko dataset, released under the equitable, Africa-centered NOODL framework. To demonstrate its immediate utility, we integrate the terminology into a Retrieval-Augmented Generation (RAG) pipeline. Experiments show substantial improvements in the accuracy and domain-specific consistency of English-to-Tshivenda machine translation for large language models.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topicslinguistics and terminology studies · Lexicography and Language Studies · Translation Studies and Practices
