Towards Zero-Shot Text-To-Speech for Arabic Dialects
Khai Duy Doan, Abdul Waheed, Muhammad Abdul-Mageed

TL;DR
This paper advances zero-shot multi-dialect Arabic text-to-speech by adapting datasets, leveraging dialect identification, and fine-tuning open-source models, achieving promising results for unseen speakers and dialectal speech synthesis.
Contribution
It introduces a novel approach combining dataset adaptation, dialect identification, and fine-tuning of open-source models for Arabic ZS-TTS, addressing resource scarcity.
Findings
Convincing automated and human evaluation results
Effective generation of dialectal speech
Significant potential for Arabic ZS-TTS improvements
Abstract
Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources. We address this gap for Arabic, a language of more than 450 million native speakers, by first adapting a sizeable existing dataset to suit the needs of speech synthesis. Additionally, we employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting. Subsequently, we fine-tune the XTTS\footnote{https://docs.coqui.ai/en/latest/models/xtts.html}\footnote{https://medium.com/machine-learns/xtts-v2-new-version-of-the-open-source-text-to-speech-model-af73914db81f}\footnote{https://medium.com/@erogol/xtts-v1-techincal-notes-eb83ff05bdc} model, an open-source architecture. We then evaluate our models on a dataset comprising 31 unseen speakers and an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsSparse Evolutionary Training
