A Transfer Learning End-to-End ArabicText-To-Speech (TTS) Deep Architecture
Fady Fahmy, Mahmoud Khalil, Hazem Abbas

TL;DR
This paper presents a novel end-to-end deep learning architecture for Arabic text-to-speech synthesis that achieves high-quality, natural speech with limited data, leveraging transfer learning and English character embeddings.
Contribution
It introduces a transfer learning-based end-to-end TTS system for Arabic, overcoming data scarcity and improving speech naturalness compared to prior methods.
Findings
High-quality Arabic speech synthesis achieved with only 2.41 hours of data
Use of English character embeddings enhances model performance
Preprocessing techniques improve speech naturalness
Abstract
Speech synthesis is the artificial production of human speech. A typical text-to-speech system converts a language text into a waveform. There exist many English TTS systems that produce mature, natural, and human-like speech synthesizers. In contrast, other languages, including Arabic, have not been considered until recently. Existing Arabic speech synthesis solutions are slow, of low quality, and the naturalness of synthesized speech is inferior to the English synthesizers. They also lack essential speech key factors such as intonation, stress, and rhythm. Different works were proposed to solve those issues, including the use of concatenative methods such as unit selection or parametric methods. However, they required a lot of laborious work and domain expertise. Another reason for such poor performance of Arabic speech synthesizers is the lack of speech corpora, unlike English that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
