# ClArTTS: An Open-Source Classical Arabic Text-to-Speech Corpus

**Authors:** Ajinkya Kulkarni, Atharva Kulkarni, Sara Abedalmonem Mohammad, Shatnawi, Hanan Aldarmaki

arXiv: 2303.00069 · 2023-03-02

## TL;DR

This paper introduces ClArTTS, a new open-source Classical Arabic speech corpus of 12 hours from a single speaker, enabling improved end-to-end TTS system development for Arabic, which previously lacked suitable large-scale datasets.

## Contribution

It provides the first large-scale, high-quality Classical Arabic TTS corpus and baseline TTS systems, filling a critical resource gap in Arabic speech synthesis research.

## Key findings

- The corpus contains about 12 hours of speech from a single male speaker.
- Baseline TTS systems based on Grad-TTS and Glow-TTS achieve promising results.
- The resource is publicly available for research and development.

## Abstract

At present, Text-to-speech (TTS) systems that are trained with high-quality transcribed speech data using end-to-end neural models can generate speech that is intelligible, natural, and closely resembles human speech. These models are trained with relatively large single-speaker professionally recorded audio, typically extracted from audiobooks. Meanwhile, due to the scarcity of freely available speech corpora of this kind, a larger gap exists in Arabic TTS research and development. Most of the existing freely available Arabic speech corpora are not suitable for TTS training as they contain multi-speaker casual speech with variations in recording conditions and quality, whereas the corpus curated for speech synthesis are generally small in size and not suitable for training state-of-the-art end-to-end models. In a move towards filling this gap in resources, we present a speech corpus for Classical Arabic Text-to-Speech (ClArTTS) to support the development of end-to-end TTS systems for Arabic. The speech is extracted from a LibriVox audiobook, which is then processed, segmented, and manually transcribed and annotated. The final ClArTTS corpus contains about 12 hours of speech from a single male speaker sampled at 40100 kHz. In this paper, we describe the process of corpus creation and provide details of corpus statistics and a comparison with existing resources. Furthermore, we develop two TTS systems based on Grad-TTS and Glow-TTS and illustrate the performance of the resulting systems via subjective and objective evaluations. The corpus will be made publicly available at www.clartts.com for research purposes, along with the baseline TTS systems demo.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2303.00069/full.md

## Figures

1 figure with captions in the complete paper: https://tomesphere.com/paper/2303.00069/full.md

## References

25 references — full list in the complete paper: https://tomesphere.com/paper/2303.00069/full.md

---
Source: https://tomesphere.com/paper/2303.00069