NaijaS2ST: A Multi-Accent Benchmark for Speech-to-Speech Translation in Low-Resource Nigerian Languages

Marie Maltais; Yejin Jeon; Min Ma; Shamsuddeen Hassan Muhammad; Idris Abdulmumin; Maryam Ibrahim Mukhtar; Daud Abolade; Joel Okepefi; Johnson Sewedo; David Ifeoluwa Adelani

arXiv:2604.16287·cs.SD·April 20, 2026

NaijaS2ST: A Multi-Accent Benchmark for Speech-to-Speech Translation in Low-Resource Nigerian Languages

Marie Maltais, Yejin Jeon, Min Ma, Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Maryam Ibrahim Mukhtar, Daud Abolade, Joel Okepefi, Johnson Sewedo, David Ifeoluwa Adelani

PDF

1 Datasets

TL;DR

NaijaS2ST introduces a comprehensive Nigerian language speech translation dataset and benchmarks various approaches, revealing audio LLMs excel in speech-to-text tasks, while speech-to-speech translation remains challenging.

Contribution

The paper presents a new multilingual Nigerian speech translation dataset and provides a systematic benchmark of different translation methods in low-resource settings.

Findings

01

Audio LLMs outperform cascaded and end-to-end models in speech-to-text translation.

02

Cascaded and audio LLM approaches perform similarly in speech-to-speech translation.

03

NaijaS2ST offers a valuable resource for advancing low-resource multilingual speech translation.

Abstract

Speech translation for low-resource languages remains fundamentally limited by the scarcity of high-quality, diverse parallel speech data, a challenge that is especially pronounced in African linguistic contexts. To address this, we introduce NaijaS2ST, a parallel speech translation dataset spanning Igbo, Hausa, Yor\`ub\'a, and Nigerian Pidgin paired with English. The dataset comprises approximately 50 hours of speech per language and captures substantial variation in speakers and accents, reflecting realistic multilingual and multi-accent conditions. With NaijaS2ST, we conduct a comprehensive benchmark of cascaded, end-to-end (E2E), and AudioLLM-based approaches across bidirectional translation settings. Our results show that audio LLMs with few-shot examples are more effective for speech-to-text translation than cascaded and end-to-end methods trained on fine-tuned data. However, for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

McGill-NLP/african_celtic_dataset
dataset· 443 dl
443 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.