Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with Disentangled Representations
Wenbin Wang, Yang Song, Sanjay Jha

TL;DR
This paper introduces GZS-TV, a novel zero-shot speaker adaptive speech synthesis model that uses disentangled representations to improve generalization and synthesis quality for unseen speakers.
Contribution
The paper presents GZS-TV, a new model employing disentangled representation learning and variational autoencoders to enhance zero-shot speaker adaptation in speech synthesis.
Findings
GZS-TV reduces performance degradation on unseen speakers.
GZS-TV outperforms baseline models across multiple datasets.
Disentangled representations improve model generalization.
Abstract
While most research into speech synthesis has focused on synthesizing high-quality speech for in-dataset speakers, an equally essential yet unsolved problem is synthesizing speech for unseen speakers who are out-of-dataset with limited reference data, i.e., speaker adaptive speech synthesis. Many studies have proposed zero-shot speaker adaptive text-to-speech and voice conversion approaches aimed at this task. However, most current approaches suffer from the degradation of naturalness and speaker similarity when synthesizing speech for unseen speakers (i.e., speakers not in the training dataset) due to the poor generalizability of the model in out-of-distribution data. To address this problem, we propose GZS-TV, a generalizable zero-shot speaker adaptive text-to-speech and voice conversion model. GZS-TV introduces disentangled representation learning for both speaker embedding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
