BENYO-S2ST-Corpus-1: A Bilingual English-to-Yoruba Direct Speech-to-Speech Translation Corpus

Emmanuel Adetiba; Abdultaofeek Abayomi; Raymond J. Kala; Ayodele H. Ifijeh; Oluwatobi E. Dare; Olabode Idowu-Bismark; Gabriel O. Sobola; Joy N. Adetiba; Monsurat Adepeju Lateef

arXiv:2507.09342·cs.SD·January 8, 2026

BENYO-S2ST-Corpus-1: A Bilingual English-to-Yoruba Direct Speech-to-Speech Translation Corpus

Emmanuel Adetiba, Abdultaofeek Abayomi, Raymond J. Kala, Ayodele H. Ifijeh, Oluwatobi E. Dare, Olabode Idowu-Bismark, Gabriel O. Sobola, Joy N. Adetiba, Monsurat Adepeju Lateef

PDF

Open Access 1 Datasets

TL;DR

This paper introduces BENYO-S2ST-Corpus-1, a large-scale bilingual speech dataset for English-to-Yoruba translation, created using a hybrid approach with AI-generated and augmented audio, enabling improved S2ST models for low-resource languages.

Contribution

The study presents a novel hybrid architecture for large-scale direct S2ST corpus creation, combining real and AI-generated data with augmentation techniques, specifically for English-Yoruba translation.

Findings

01

Created a 24,064 sample bilingual corpus with 41.20 hours of audio

02

Built a pretrained Yoruba TTS model with moderate pitch accuracy

03

Provided publicly available resources for low-resource language translation

Abstract

There is a major shortage of Speech-to-Speech Translation (S2ST) datasets for high resource-to-low resource language pairs such as English-to-Yoruba. Thus, in this study, we curated the Bilingual English-to-Yoruba Speech-to-Speech Translation Corpus Version 1 (BENYO-S2ST-Corpus-1). The corpus is based on a hybrid architecture we developed for large-scale direct S2ST corpus creation at reduced cost. To achieve this, we leveraged non speech-to-speech Standard Yoruba (SY) real-time audios and transcripts in the YORULECT Corpus as well as the corresponding Standard English (SE) transcripts. YORULECT Corpus is small scale(1,504) samples, and it does not have paired English audios. Therefore, we generated the SE audios using pre-trained AI models (i.e. Facebook MMS). We also developed an audio augmentation algorithm named AcoustAug based on three latent acoustic features to generate augmented…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

aspmirlab/BENYO-S2ST-Corpus-1
dataset· 6 dl
6 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Speech and dialogue systems