Arabic TTS with FastPitch: Reproducible Baselines, Adversarial Training, and Oversmoothing Analysis
Lars Nippert

TL;DR
This paper develops reproducible FastPitch-based Arabic TTS baselines, introduces new metrics for oversmoothing analysis, and employs adversarial training and synthetic data augmentation to improve prosody and stability.
Contribution
It presents the first reproducible FastPitch Arabic TTS baseline, introduces cepstral-domain oversmoothing metrics, and enhances multi-speaker TTS with adversarial loss and synthetic voice augmentation.
Findings
Adversarial spectrogram loss reduces oversmoothing effectively.
Cepstral metrics reveal oversmoothing effects during training.
Synthetic voices improve prosodic diversity in multi-speaker TTS.
Abstract
Arabic text-to-speech (TTS) remains challenging due to limited resources and complex phonological patterns. We present reproducible baselines for Arabic TTS built on the FastPitch architecture and introduce cepstral-domain metrics for analyzing oversmoothing in mel-spectrogram prediction. While traditional Lp reconstruction losses yield smooth but over-averaged outputs, the proposed metrics reveal their temporal and spectral effects throughout training. To address this, we incorporate a lightweight adversarial spectrogram loss, which trains stably and substantially reduces oversmoothing. We further explore multi-speaker Arabic TTS by augmenting FastPitch with synthetic voices generated using XTTSv2, resulting in improved prosodic diversity without loss of stability. The code, pretrained models, and training recipes are publicly available at:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing
