Swivuriso: The South African Next Voices Multilingual Speech Dataset

Vukosi Marivate; Kayode Olaleye; Sitwala Mundia; Andinda Bakainga; Unarine Netshifhefhe; Mahmooda Milanzie; Tsholofelo Hope Mogale; Thapelo Sindane; Zainab Abdulrasaq; Kesego Mokgosi; Chijioke Okorie; Nia Zion Van Wyk; Graham Morrissey; Dale Dunbar; Francois Smit; Tsosheletso Chidi; Rooweither Mabuya; Andiswa Bukula; Respect Mlambo; Tebogo Macucwa; Idris Abdulmumin; and Seani Rananga

arXiv:2512.02201·cs.CL·January 21, 2026

Swivuriso: The South African Next Voices Multilingual Speech Dataset

Vukosi Marivate, Kayode Olaleye, Sitwala Mundia, Andinda Bakainga, Unarine Netshifhefhe, Mahmooda Milanzie, Tsholofelo Hope Mogale, Thapelo Sindane, Zainab Abdulrasaq, Kesego Mokgosi, Chijioke Okorie, Nia Zion Van Wyk, Graham Morrissey, Dale Dunbar, Francois Smit

PDF

Open Access 2 Datasets

TL;DR

Swivuriso is a comprehensive 3000-hour multilingual speech dataset for seven South African languages, aimed at advancing ASR technology and benchmarking in underrepresented languages and domains.

Contribution

The paper introduces Swivuriso, a large-scale, ethically collected speech dataset for South African languages, filling a critical gap in resources for multilingual ASR development.

Findings

01

Baseline ASR models trained on Swivuriso show promising performance.

02

Swivuriso outperforms existing datasets in domain coverage and language diversity.

03

The dataset enables improved benchmarking for South African language ASR systems.

Abstract

This paper introduces Swivuriso, a 3000-hour multilingual speech dataset developed as part of the African Next Voices project, to support the development and benchmarking of automatic speech recognition (ASR) technologies in seven South African languages. Covering agriculture, healthcare, and general domain topics, Swivuriso addresses significant gaps in existing ASR datasets. We describe the design principles, ethical considerations, and data collection procedures that guided the dataset creation. We present baseline results of training/finetuning ASR models with this data and compare to other ASR datasets for the langauges concerned.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · ICT in Developing Communities · AI in Service Interactions