# Advancing automatic speech recognition for low-resource ghanaian languages: Audio datasets for Akan, Ewe, Dagbani, Dagaare, and Ikposo

**Authors:** Isaac Wiafe, Jamal-Deen Abdulai, Akon Obu Ekpezu, Raynard Dodzi Helegah, Elikem Doe Atsakpo, Charles Nutrokpor, Fiifi Baffoe Payin Winful, Kafui Kwashie Solaga

PMC · DOI: 10.1016/j.dib.2025.111880 · Data in Brief · 2025-07-11

## TL;DR

This paper introduces a 5000-hour audio dataset for five low-resource Ghanaian languages to improve speech recognition and language preservation.

## Contribution

The paper presents the first large-scale, validated audio datasets for Akan, Ewe, Dagbani, Dagaare, and Ikposo.

## Key findings

- A 1000-hour validated audio corpus was created for each of the five low-resource languages.
- 100 hours of transcribed audio per language were generated to support ASR and linguistic research.
- The dataset was collected ethically with incentives for indigenous speakers.

## Abstract

Audio datasets are fundamental to the development of automatic speech-recognition (ASR) systems. However, the availability of a large corpus of audio datasets in low-resource languages (LRLs) is limited. This study addresses this gap by introducing audio speech datasets for five low-resource languages spoken in Ghana and parts of Togo. Specifically, it presents a 5000-hour speech corpus in Akan, Ewe, Dagbani, Dagaare, and Ikposo. Each language corpus includes 1000 h of validated audio speech recorded by their indigenous speakers. These audio recordings are spoken descriptions of 1000 culturally relevant images collected using a custom Android mobile application. To enhance the dataset’s utility in ASR and linguistic research 10 % of the audio recordings for each language were randomly selected and transcribed, resulting in approximately 100 h of transcription per language. This dataset represents a critical resource for preserving and documenting Ghanaian languages. It holds the potential for advancing speech and language technologies in these languages. Creating this audio dataset is the first step towards bridging the technological gap between high- and low-resource languages. Ethical guidelines were strictly followed throughout the data collection process and participants were given incentives for lending their voices to this study.

## Full-text entities

- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12301755/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12301755/full.md

## References

6 references — full list in the complete paper: https://tomesphere.com/paper/PMC12301755/full.md

---
Source: https://tomesphere.com/paper/PMC12301755