Towards Improved Zero-shot Voice Conversion with Conditional DSVAE
Jiachen Lian, Chunlei Zhang, Gopala Krishna Anumanchipalli and, Dong Yu

TL;DR
This paper introduces a conditional DSVAE model that improves zero-shot voice conversion by better preserving phonetic information in content embeddings, leading to enhanced performance over previous DSVAE approaches.
Contribution
The study proposes a conditional DSVAE that conditions content embedding on phonetic bias, improving phonetic preservation and zero-shot VC performance compared to the baseline DSVAE.
Findings
Conditional DSVAE achieves higher phoneme classification accuracy.
Content embeddings are more stable and phonetic-rich.
Zero-shot VC performance is significantly improved.
Abstract
Disentangling content and speaking style information is essential for zero-shot non-parallel voice conversion (VC). Our previous study investigated a novel framework with disentangled sequential variational autoencoder (DSVAE) as the backbone for information decomposition. We have demonstrated that simultaneous disentangling content embedding and speaker embedding from one utterance is feasible for zero-shot VC. In this study, we continue the direction by raising one concern about the prior distribution of content branch in the DSVAE baseline. We find the random initialized prior distribution will force the content embedding to reduce the phonetic-structure information during the learning process, which is not a desired property. Here, we seek to achieve a better content embedding with more phonetic information preserved. We propose conditional DSVAE, a new model that enables content…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
