Towards Improved Zero-shot Voice Conversion with Conditional DSVAE

Jiachen Lian; Chunlei Zhang; Gopala Krishna Anumanchipalli and; Dong Yu

arXiv:2205.05227·eess.AS·June 22, 2022

Towards Improved Zero-shot Voice Conversion with Conditional DSVAE

Jiachen Lian, Chunlei Zhang, Gopala Krishna Anumanchipalli and, Dong Yu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a conditional DSVAE model that improves zero-shot voice conversion by better preserving phonetic information in content embeddings, leading to enhanced performance over previous DSVAE approaches.

Contribution

The study proposes a conditional DSVAE that conditions content embedding on phonetic bias, improving phonetic preservation and zero-shot VC performance compared to the baseline DSVAE.

Findings

01

Conditional DSVAE achieves higher phoneme classification accuracy.

02

Content embeddings are more stable and phonetic-rich.

03

Zero-shot VC performance is significantly improved.

Abstract

Disentangling content and speaking style information is essential for zero-shot non-parallel voice conversion (VC). Our previous study investigated a novel framework with disentangled sequential variational autoencoder (DSVAE) as the backbone for information decomposition. We have demonstrated that simultaneous disentangling content embedding and speaker embedding from one utterance is feasible for zero-shot VC. In this study, we continue the direction by raising one concern about the prior distribution of content branch in the DSVAE baseline. We find the random initialized prior distribution will force the content embedding to reduce the phonetic-structure information during the learning process, which is not a desired property. Here, we seek to achieve a better content embedding with more phonetic information preserved. We propose conditional DSVAE, a new model that enables content…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jlian2/Improved-Voice-Conversion-with-Conditional-DSVAE
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing