Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo   Language

Turi Abu; Ying Shi; Thomas Fang Zheng; Dong Wang

arXiv:2502.00421·cs.CL·February 4, 2025

Sagalee: an Open Source Automatic Speech Recognition Dataset for Oromo Language

Turi Abu, Ying Shi, Thomas Fang Zheng, Dong Wang

PDF

Open Access 1 Repo 3 Datasets

TL;DR

This paper introduces Sagalee, a comprehensive open-source speech dataset for Oromo, enabling advancements in ASR technology for this underrepresented language, and demonstrates baseline results with various models.

Contribution

The creation of the first large-scale, publicly available Oromo ASR dataset collected via crowdsourcing, with baseline ASR performance benchmarks.

Findings

01

Achieved a WER of 15.32% with Conformer model using hybrid loss

02

Fine-tuning Whisper reduced WER to 10.82%

03

Dataset covers diverse speakers and noisy environments

Abstract

We present a novel Automatic Speech Recognition (ASR) dataset for the Oromo language, a widely spoken language in Ethiopia and neighboring regions. The dataset was collected through a crowd-sourcing initiative, encompassing a diverse range of speakers and phonetic variations. It consists of 100 hours of real-world audio recordings paired with transcriptions, covering read speech in both clean and noisy environments. This dataset addresses the critical need for ASR resources for the Oromo language which is underrepresented. To show its applicability for the ASR task, we conducted experiments using the Conformer model, achieving a Word Error Rate (WER) of 15.32% with hybrid CTC and AED loss and WER of 18.74% with pure CTC loss. Additionally, fine-tuning the Whisper model resulted in a significantly improved WER of 10.82%. These results establish baselines for Oromo ASR, highlighting both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

turinaf/sagalee
pytorchOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsLanguage, Linguistics, Cultural Analysis · Natural Language Processing Techniques · African history and culture analysis