Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion
Jiachen Lian, Chunlei Zhang, Dong Yu

TL;DR
This paper introduces a self-supervised disentangled speech representation learning method for zero-shot voice conversion, achieving state-of-the-art results and robustness against noise without relying on parallel data or known speakers.
Contribution
It proposes a novel disentanglement approach using a sequential VAE and a data augmentation strategy for zero-shot VC, advancing the field without parallel training data.
Findings
State-of-the-art performance on TIMIT and VCTK datasets.
Robust voice conversion with noisy source/target utterances.
Effective disentanglement of speaker and content representations.
Abstract
Traditional studies on voice conversion (VC) have made progress with parallel training data and known speakers. Good voice conversion quality is obtained by exploring better alignment modules or expressive mapping functions. In this study, we investigate zero-shot VC from a novel perspective of self-supervised disentangled speech representation learning. Specifically, we achieve the disentanglement by balancing the information flow between global speaker representation and time-varying content representation in a sequential variational autoencoder (VAE). A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to the VAE decoder. Besides that, an on-the-fly data augmentation training strategy is applied to make the learned representation noise invariant. On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
