Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker   Adaptation in Text-to-Speech Synthesis

Yixuan Zhou; Changhe Song; Xiang Li; Luwen Zhang; Zhiyong Wu; Yanyao; Bian; Dan Su; Helen Meng

arXiv:2204.00990·cs.SD·November 14, 2022

Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis

Yixuan Zhou, Changhe Song, Xiang Li, Luwen Zhang, Zhiyong Wu, Yanyao, Bian, Dan Su, Helen Meng

PDF

Open Access

TL;DR

This paper introduces a content-dependent fine-grained speaker embedding method for zero-shot text-to-speech adaptation, significantly improving speaker similarity by capturing personal pronunciation nuances.

Contribution

It proposes a novel local content embedding approach with a reference attention module to better model individual pronunciation characteristics in zero-shot TTS.

Findings

01

Enhanced speaker similarity in synthesized speech for unseen speakers.

02

Effective modeling of personal pronunciation traits improves TTS quality.

03

Outperforms previous global speaker embedding methods.

Abstract

Zero-shot speaker adaptation aims to clone an unseen speaker's voice without any adaptation time and parameters. Previous researches usually use a speaker encoder to extract a global fixed speaker embedding from reference speech, and several attempts have tried variable-length speaker embedding. However, they neglect to transfer the personal pronunciation characteristics related to phoneme content, leading to poor speaker similarity in terms of detailed speaking styles and pronunciation habits. To improve the ability of the speaker encoder to model personal pronunciation characteristics, we propose content-dependent fine-grained speaker embedding for zero-shot speaker adaptation. The corresponding local content embeddings and speaker embeddings are extracted from a reference speech, respectively. Instead of modeling the temporal relations, a reference attention module is introduced to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing