An Exploration of ECAPA-TDNN and x-vector Speaker Representations in Zero-shot Multi-speaker TTS

Marie Kune\v{s}ov\'a; Zden\v{e}k Hanzl\'i\v{c}ek; Jind\v{r}ich Matou\v{s}ek

arXiv:2506.20190·eess.AS·September 1, 2025

An Exploration of ECAPA-TDNN and x-vector Speaker Representations in Zero-shot Multi-speaker TTS

Marie Kune\v{s}ov\'a, Zden\v{e}k Hanzl\'i\v{c}ek, Jind\v{r}ich Matou\v{s}ek

PDF

TL;DR

This study compares different speaker embedding methods in zero-shot multi-speaker TTS, revealing that the original H/ASP encoder outperforms ECAPA-TDNN and x-vectors in speaker similarity, emphasizing the need for empirical validation.

Contribution

It provides a systematic comparison of speaker encoders in zero-shot TTS, demonstrating that popular recognition embeddings may not always be optimal for TTS applications.

Findings

01

H/ASP encoder outperforms ECAPA-TDNN and x-vectors in speaker similarity

02

ECAPA-TDNN performs better than x-vectors but worse than H/ASP

03

Empirical evaluation highlights the importance of task-specific testing for speaker embeddings

Abstract

Zero-shot multi-speaker text-to-speech (TTS) systems rely on speaker embeddings to synthesize speech in the voice of an unseen speaker, using only a short reference utterance. While many speaker embeddings have been developed for speaker recognition, their relative effectiveness in zero-shot TTS remains underexplored. In this work, we employ a YourTTS-based TTS system to compare three different speaker encoders - YourTTS's original H/ASP encoder, x-vector embeddings, and ECAPA-TDNN embeddings - within an otherwise fixed zero-shot TTS framework. All models were trained on the same dataset of Czech read speech and evaluated on 24 out-of-domain target speakers using both subjective and objective methods. The subjective evaluation was conducted via a listening test focused on speaker similarity, while the objective evaluation measured cosine distances between speaker embeddings extracted…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.