ManaTTS Persian: a recipe for creating TTS datasets for lower resource languages
Mahta Fetrat Qharabagh, Zahra Dehghanian, Hamid R. Rabiee

TL;DR
This paper introduces ManaTTS, a large Persian speech dataset, along with tools and methods for dataset creation and speech recognition, enabling high-quality TTS and low-resource language processing.
Contribution
It provides the largest open Persian speech corpus and a transparent pipeline with novel tools for dataset collection and forced alignment tailored for low-resource languages.
Findings
Achieved a MOS of 3.76 with the TTS model, close to natural speech quality.
Developed a fully open, MIT-licensed pipeline for dataset creation and alignment.
Extended speech recognition evaluation with the VirgoolInformal dataset.
Abstract
In this study, we introduce ManaTTS, the most extensive publicly accessible single-speaker Persian corpus, and a comprehensive framework for collecting transcribed speech datasets for the Persian language. ManaTTS, released under the open CC-0 license, comprises approximately 86 hours of audio with a sampling rate of 44.1 kHz. Alongside ManaTTS, we also generated the VirgoolInformal dataset to evaluate Persian speech recognition models used for forced alignment, extending over 5 hours of audio. The datasets are supported by a fully transparent, MIT-licensed pipeline, a testament to innovation in the field. It includes unique tools for sentence tokenization, bounded audio segmentation, and a novel forced alignment method. This alignment technique is specifically designed for low-resource languages, addressing a crucial need in the field. With this dataset, we trained a Tacotron2-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
