GAZEV: GAN-Based Zero-Shot Voice Conversion over Non-parallel Speech   Corpus

Zining Zhang; Bingsheng He; Zhenjie Zhang

arXiv:2010.12788·cs.SD·October 27, 2020·1 cites

GAZEV: GAN-Based Zero-Shot Voice Conversion over Non-parallel Speech Corpus

Zining Zhang, Bingsheng He, Zhenjie Zhang

PDF

Open Access

TL;DR

GAZEV is a GAN-based zero-shot voice conversion system that effectively converts speech between unseen speakers without requiring parallel data, improving speech quality and maintaining speaker similarity.

Contribution

Introduces GAZEV, a novel GAN-based zero-shot voice conversion method using speaker embedding loss and adaptive normalization, enabling conversion for unseen speakers without parallel data.

Findings

01

Significant improvement in speech quality over existing methods

02

Comparable speaker similarity to AUTOVC on unseen speakers

03

Effective zero-shot conversion without parallel corpora

Abstract

Non-parallel many-to-many voice conversion is recently attract-ing huge research efforts in the speech processing community. A voice conversion system transforms an utterance of a source speaker to another utterance of a target speaker by keeping the content in the original utterance and replacing by the vocal features from the target speaker. Existing solutions, e.g., StarGAN-VC2, present promising results, only when speech corpus of the engaged speakers is available during model training. AUTOVCis able to perform voice conversion on unseen speakers, but it needs an external pretrained speaker verification model. In this paper, we present our new GAN-based zero-shot voice conversion solution, called GAZEV, which targets to support unseen speakers on both source and target utterances. Our key technical contribution is the adoption of speaker embedding loss on top of the GAN framework,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing