# A curated crowdsourced dataset of Luganda and Swahili speech for text-to-speech synthesis

**Authors:** Andrew Katumba, Sulaiman Kagumire, Joyce Nakatumba-Nabende, John Quinn, Sudi Murindanyi

PMC · DOI: 10.1016/j.dib.2025.111915 · Data in Brief · 2025-07-23

## TL;DR

This paper introduces a high-quality, curated speech dataset in Luganda and Kiswahili to improve text-to-speech systems in low-resource languages.

## Contribution

A new crowdsourced speech dataset for Luganda and Kiswahili with rigorous curation and preprocessing for TTS research.

## Key findings

- The dataset includes over 19 hours of Luganda and 15 hours of Kiswahili speech from six female speakers per language.
- A multi-step curation process improved data consistency and quality using acoustic clustering and speech quality scoring.
- The dataset supports reproducible TTS research and speech generation in under-resourced African languages.

## Abstract

This data article describes a curated, crowdsourced speech dataset in Luganda and Kiswahili, created to support text-to-speech (TTS) development in low-resource settings. The dataset is derived from Mozilla’s Common Voice corpus and includes only validated utterances from female speakers. A multi-step curation process was used to enhance the consistency and quality of the data. Speakers were first manually selected based on similarities in intonation, pitch, and rhythm, then validated using acoustic clustering with pitch features and mel-frequency cepstral coefficients (MFCCs). Audio files were preprocessed to remove leading and trailing silences using WebRTC voice activity detection, denoised with a causal waveform-based DEMUCS model, and filtered using WV-MOS, an automatic speech quality scoring tool. Only clips with a predicted MOS score of 3.5 or higher were retained. The final dataset contains over 19 h of Luganda and 15 h of Kiswahili recordings from six female speakers per language, each paired with a text transcription. This dataset is designed to support speech generation research in Luganda and Kiswahili and enable reproducible experimentation in end-to-end TTS systems.

## Full-text entities

- **Diseases:** TTS (MESH:D013064), MOS (MESH:D009800)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12337013/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/PMC12337013/full.md

## References

7 references — full list in the complete paper: https://tomesphere.com/paper/PMC12337013/full.md

---
Source: https://tomesphere.com/paper/PMC12337013