An objective evaluation of the effects of recording conditions and speaker characteristics in multi-speaker deep neural speech synthesis
Beata Lorincz, Adriana Stan, Mircea Giurgiu

TL;DR
This study systematically evaluates how recording conditions, speaker gender, and text representation affect the quality of multi-speaker neural TTS, revealing significant impacts of recording conditions and variable effects of text features.
Contribution
It provides an objective analysis of factors influencing multi-speaker TTS quality, using a large Romanian corpus and multiple evaluation metrics, highlighting the importance of recording conditions.
Findings
Recording conditions significantly affect synthetic quality.
Speaker gender does not influence output quality.
Extended text features do not uniformly improve synthesis.
Abstract
Multi-speaker spoken datasets enable the creation of text-to-speech synthesis (TTS) systems which can output several voice identities. The multi-speaker (MSPK) scenario also enables the use of fewer training samples per speaker. However, in the resulting acoustic model, not all speakers exhibit the same synthetic quality, and some of the voice identities cannot be used at all. In this paper we evaluate the influence of the recording conditions, speaker gender, and speaker particularities over the quality of the synthesised output of a deep neural TTS architecture, namely Tacotron2. The evaluation is possible due to the use of a large Romanian parallel spoken corpus containing over 81 hours of data. Within this setup, we also evaluate the influence of different types of text representations: orthographic, phonetic, and phonetic extended with syllable boundaries and lexical stress…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
