An analysis on the effects of speaker embedding choice in non   auto-regressive TTS

Adriana Stan; Johannah O'Mahony

arXiv:2307.09898·eess.AS·July 20, 2023

An analysis on the effects of speaker embedding choice in non auto-regressive TTS

Adriana Stan, Johannah O'Mahony

PDF

Open Access

TL;DR

This study investigates how different speaker embedding choices affect non-autoregressive multi-speaker TTS systems, revealing that embedding set and initialization have minimal impact on speech quality but influence speaker identity representation.

Contribution

It provides the first detailed analysis of how speaker embedding selection and training strategies influence non-autoregressive TTS performance and speaker identity leakage.

Findings

01

Speaker identity is well preserved regardless of embedding set.

02

Speaker leakage occurs in the core speech abstraction, regardless of training strategy.

03

Embedding choice has minimal effect on speech quality.

Abstract

In this paper we introduce a first attempt on understanding how a non-autoregressive factorised multi-speaker speech synthesis architecture exploits the information present in different speaker embedding sets. We analyse if jointly learning the representations, and initialising them from pretrained models determine any quality improvements for target speaker identities. In a separate analysis, we investigate how the different sets of embeddings impact the network's core speech abstraction (i.e. zero conditioned) in terms of speaker identity and representation learning. We show that, regardless of the used set of embeddings and learning strategy, the network can handle various speaker identities equally well, with barely noticeable variations in speech output quality, and that speaker leakage within the core structure of the synthesis system is inevitable in the standard training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing