The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages

Hillary Mutisya; John Mugane; Gavin Nyamboga; Brian Chege; Maryruth Gathoni

arXiv:2603.29244·cs.CL·April 7, 2026

The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages

Hillary Mutisya, John Mugane, Gavin Nyamboga, Brian Chege, Maryruth Gathoni

PDF

3 Datasets

TL;DR

The paper introduces the Thiomi Dataset, a comprehensive multimodal corpus for ten African languages, enabling improved speech and language technology applications with significant baseline results.

Contribution

It provides a large-scale, multi-language dataset with benchmarks for ASR, MT, and TTS, advancing low-resource African language processing.

Findings

01

Achieved 3.24% WER on Swahili ASR, surpassing previous SOTA.

02

Collected over 601,000 text annotations and 385,000 audio recordings.

03

Demonstrated the dataset's utility through baseline models.

Abstract

We present the Thiomi Dataset, a large-scale multimodal corpus spanning ten African languages across four language families: Swahili, Kikuyu, Kamba, Kimeru, Luo, Maasai, Kipsigis, Somali (East Africa); Wolof (West Africa); and Fulani (West/Central Africa). The dataset contains over 601,000 approved sentence-level text annotations and over 385,000 audio recordings, collected through a dedicated community data collection platform involving over 100 contributors. To validate the dataset's utility, we train and evaluate ASR, MT, and TTS models, establishing baselines across all languages. Our best ASR system achieves 3.24% WER on Swahili (Common Voice), reducing prior academic SOTA from 8.3% to 3.24% (5.1 percentage point absolute, 61% relative reduction), and 4.3% WER on Somali. The dataset will be published on HuggingFace. We describe the collection platform, quality assurance workflows,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.