TL;DR
This paper evaluates open-source TTS tools across languages and resources, highlighting challenges in setup, data, and efficiency, to guide inclusive and practical speech synthesis development.
Contribution
It provides a systematic assessment of four popular open-source TTS architectures, focusing on real-world feasibility and reproducibility in diverse contexts.
Findings
Significant setup and data preprocessing challenges identified.
Computational efficiency issues hinder low-resource adoption.
Evaluation includes both objective metrics and subjective listening tests.
Abstract
Open-source text-to-speech (TTS) frameworks have emerged as highly adaptable platforms for developing speech synthesis systems across a wide range of languages. However, their applicability is not uniform -- particularly when the target language is under-resourced or when computational resources are constrained. In this study, we systematically assess the feasibility of building novel TTS models using four widely adopted open-source architectures: FastPitch, VITS, Grad-TTS, and Matcha-TTS. Our evaluation spans multiple dimensions, including qualitative aspects such as ease of installation, dataset preparation, and hardware requirements, as well as quantitative assessments of synthesis quality for Romanian. We employ both objective metrics and subjective listening tests to evaluate intelligibility, speaker similarity, and naturalness of the generated speech. The results reveal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
