Robust Disentangled Variational Speech Representation Learning for   Zero-shot Voice Conversion

Jiachen Lian; Chunlei Zhang; Dong Yu

arXiv:2203.16705·eess.AS·April 1, 2022

Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion

Jiachen Lian, Chunlei Zhang, Dong Yu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a self-supervised disentangled speech representation learning method for zero-shot voice conversion, achieving state-of-the-art results and robustness against noise without relying on parallel data or known speakers.

Contribution

It proposes a novel disentanglement approach using a sequential VAE and a data augmentation strategy for zero-shot VC, advancing the field without parallel training data.

Findings

01

State-of-the-art performance on TIMIT and VCTK datasets.

02

Robust voice conversion with noisy source/target utterances.

03

Effective disentanglement of speaker and content representations.

Abstract

Traditional studies on voice conversion (VC) have made progress with parallel training data and known speakers. Good voice conversion quality is obtained by exploring better alignment modules or expressive mapping functions. In this study, we investigate zero-shot VC from a novel perspective of self-supervised disentangled speech representation learning. Specifically, we achieve the disentanglement by balancing the information flow between global speaker representation and time-varying content representation in a sequential variational autoencoder (VAE). A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to the VAE decoder. Besides that, an on-the-fly data augmentation training strategy is applied to make the learned representation noise invariant. On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jlian2/Robust-Voice-Style-Transfer
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing