LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models

Ahmed Khaled Khamis; Hesham Ali

arXiv:2602.15675·cs.CL·March 30, 2026

LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models

Ahmed Khaled Khamis, Hesham Ali

PDF

1 Models 1 Datasets 1 Video

TL;DR

This paper introduces NileTTS, a new Egyptian Arabic speech dataset created via a novel synthetic pipeline using large language models, and demonstrates its effectiveness in training dialect-specific TTS models.

Contribution

The paper presents the first Egyptian Arabic TTS dataset, a reproducible synthetic data pipeline, and an open-source fine-tuned TTS model for dialectal speech synthesis.

Findings

01

NileTTS contains 38 hours of transcribed speech from two speakers.

02

Fine-tuning XTTS v2 on NileTTS improves dialectal TTS performance.

03

Resources are publicly released for research use.

Abstract

Despite the advances in neural text to speech (TTS), many Arabic dialectal varieties remain marginally addressed, with most resources concentrated on Modern Spoken Arabic (MSA) and Gulf dialects, leaving Egyptian Arabic -- the most widely understood Arabic dialect -- severely under-resourced. We address this gap by introducing NileTTS: 38 hours of transcribed speech from two speakers across diverse domains including medical, sales, and general conversations. We construct this dataset using a novel synthetic pipeline: large language models (LLM) generate Egyptian Arabic content, which is then converted to natural speech using audio synthesis tools, followed by automatic transcription and speaker diarization with manual quality verification. We fine-tune XTTS v2, a state-of-the-art multilingual TTS model, on our dataset and evaluate against the baseline model trained on other Arabic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
KickItLikeShika/NileTTS-XTTS
model· 103 dl· ♡ 3
103 dl♡ 3

Datasets

KickItLikeShika/NileTTS-dataset
dataset· 599 dl
599 dl

Videos

LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models· underline