An Exploration of ECAPA-TDNN and x-vector Speaker Representations in Zero-shot Multi-speaker TTS
Marie Kune\v{s}ov\'a, Zden\v{e}k Hanzl\'i\v{c}ek, Jind\v{r}ich Matou\v{s}ek

TL;DR
This study compares different speaker embedding methods in zero-shot multi-speaker TTS, revealing that the original H/ASP encoder outperforms ECAPA-TDNN and x-vectors in speaker similarity, emphasizing the need for empirical validation.
Contribution
It provides a systematic comparison of speaker encoders in zero-shot TTS, demonstrating that popular recognition embeddings may not always be optimal for TTS applications.
Findings
H/ASP encoder outperforms ECAPA-TDNN and x-vectors in speaker similarity
ECAPA-TDNN performs better than x-vectors but worse than H/ASP
Empirical evaluation highlights the importance of task-specific testing for speaker embeddings
Abstract
Zero-shot multi-speaker text-to-speech (TTS) systems rely on speaker embeddings to synthesize speech in the voice of an unseen speaker, using only a short reference utterance. While many speaker embeddings have been developed for speaker recognition, their relative effectiveness in zero-shot TTS remains underexplored. In this work, we employ a YourTTS-based TTS system to compare three different speaker encoders - YourTTS's original H/ASP encoder, x-vector embeddings, and ECAPA-TDNN embeddings - within an otherwise fixed zero-shot TTS framework. All models were trained on the same dataset of Czech read speech and evaluated on 24 out-of-domain target speakers using both subjective and objective methods. The subjective evaluation was conducted via a listening test focused on speaker similarity, while the objective evaluation measured cosine distances between speaker embeddings extracted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
