Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with   Disentangled Representations

Wenbin Wang; Yang Song; Sanjay Jha

arXiv:2308.13007·cs.SD·August 28, 2023

Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations

Wenbin Wang, Yang Song, Sanjay Jha

PDF

Open Access

TL;DR

This paper introduces GZS-TV, a novel zero-shot speaker adaptive speech synthesis model that uses disentangled representations to improve generalization and synthesis quality for unseen speakers.

Contribution

The paper presents GZS-TV, a new model employing disentangled representation learning and variational autoencoders to enhance zero-shot speaker adaptation in speech synthesis.

Findings

01

GZS-TV reduces performance degradation on unseen speakers.

02

GZS-TV outperforms baseline models across multiple datasets.

03

Disentangled representations improve model generalization.

Abstract

While most research into speech synthesis has focused on synthesizing high-quality speech for in-dataset speakers, an equally essential yet unsolved problem is synthesizing speech for unseen speakers who are out-of-dataset with limited reference data, i.e., speaker adaptive speech synthesis. Many studies have proposed zero-shot speaker adaptive text-to-speech and voice conversion approaches aimed at this task. However, most current approaches suffer from the degradation of naturalness and speaker similarity when synthesizing speech for unseen speakers (i.e., speakers not in the training dataset) due to the poor generalizability of the model in out-of-distribution data. To address this problem, we propose GZS-TV, a generalizable zero-shot speaker adaptive text-to-speech and voice conversion model. GZS-TV introduces disentangled representation learning for both speaker embedding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing