More Data, Fewer Diacritics: Scaling Arabic TTS

Ahmed Musleh; Yifan Zhang; Kareem Darwish

arXiv:2603.01622·cs.CL·March 3, 2026

More Data, Fewer Diacritics: Scaling Arabic TTS

Ahmed Musleh, Yifan Zhang, Kareem Darwish

PDF

Open Access

TL;DR

This paper demonstrates that large-scale automatically annotated Arabic speech data can effectively train TTS models, reducing reliance on diacritics and enabling scalable Arabic speech synthesis.

Contribution

It introduces a pipeline for collecting and processing large Arabic speech datasets and shows that data scale can offset the need for diacritization in TTS training.

Findings

01

Larger datasets improve TTS quality even without diacritics.

02

Models trained on 4,000 hours of data outperform smaller datasets.

03

Diacritized data generally yields better results, but scale mitigates this advantage.

Abstract

Arabic Text-to-Speech (TTS) research has been hindered by the availability of both publicly available training data and accurate Arabic diacritization models. In this paper, we address the limitation by exploring Arabic TTS training on large automatically annotated data. Namely, we built a robust pipeline for collecting Arabic recordings and processing them automatically using voice activity detection, speech recognition, automatic diacritization, and noise filtering, resulting in around 4,000 hours of Arabic TTS training data. We then trained several robust TTS models with voice cloning using varying amounts of data, namely 100, 1,000, and 4,000 hours with and without diacritization. We show that though models trained on diacritized data are generally better, larger amounts of training data compensate for the lack of diacritics to a significant degree. We plan to release a public…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and dialogue systems