LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models
Ahmed Khaled Khamis, Hesham Ali

TL;DR
This paper introduces NileTTS, a new Egyptian Arabic speech dataset created via a novel synthetic pipeline using large language models, and demonstrates its effectiveness in training dialect-specific TTS models.
Contribution
The paper presents the first Egyptian Arabic TTS dataset, a reproducible synthetic data pipeline, and an open-source fine-tuned TTS model for dialectal speech synthesis.
Findings
NileTTS contains 38 hours of transcribed speech from two speakers.
Fine-tuning XTTS v2 on NileTTS improves dialectal TTS performance.
Resources are publicly released for research use.
Abstract
Despite the advances in neural text to speech (TTS), many Arabic dialectal varieties remain marginally addressed, with most resources concentrated on Modern Spoken Arabic (MSA) and Gulf dialects, leaving Egyptian Arabic -- the most widely understood Arabic dialect -- severely under-resourced. We address this gap by introducing NileTTS: 38 hours of transcribed speech from two speakers across diverse domains including medical, sales, and general conversations. We construct this dataset using a novel synthetic pipeline: large language models (LLM) generate Egyptian Arabic content, which is then converted to natural speech using audio synthesis tools, followed by automatic transcription and speaker diarization with manual quality verification. We fine-tune XTTS v2, a state-of-the-art multilingual TTS model, on our dataset and evaluate against the baseline model trained on other Arabic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
