Voices Unheard: NLP Resources and Models for Yor\`ub\'a Regional Dialects
Orevaoghene Ahia, Anuoluwapo Aremu, Diana Abagyan, Hila Gonen, David, Ifeoluwa Adelani, Daud Abolade, Noah A. Smith, Yulia Tsvetkov

TL;DR
This paper introduces a new high-quality Yor extquotesingle ub extquotesingle a dialect corpus, conducts NLP experiments revealing disparities across dialects, and demonstrates that dialect-adaptive finetuning can improve performance, aiding future NLP development for African languages.
Contribution
The paper presents YOR extquotesingle ULECT, a comprehensive parallel corpus for Yor extquotesingle ub extquotesingle a dialects, and provides experimental insights into dialectal disparities and adaptation techniques.
Findings
Significant performance gaps between standard and dialectal Yor extquotesingle ub extquotesingle a in NLP tasks.
Dialect-adaptive finetuning reduces performance disparities.
The dataset and models are publicly released for further research.
Abstract
Yor\`ub\'a an African language with roughly 47 million speakers encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which there are little to no resources or tools. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus YOR\`ULECT across three domains and four regional Yor\`ub\'a dialects. To develop this corpus, we engaged native speakers, travelling to communities where these dialects are spoken, to collect text and speech data. Using our newly created corpus, we conducted extensive experiments on (text) machine translation, automatic speech recognition, and speech-to-text translation. Our results reveal substantial performance disparities between standard Yor\`ub\'a and the other…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
