GAZEV: GAN-Based Zero-Shot Voice Conversion over Non-parallel Speech Corpus
Zining Zhang, Bingsheng He, Zhenjie Zhang

TL;DR
GAZEV is a GAN-based zero-shot voice conversion system that effectively converts speech between unseen speakers without requiring parallel data, improving speech quality and maintaining speaker similarity.
Contribution
Introduces GAZEV, a novel GAN-based zero-shot voice conversion method using speaker embedding loss and adaptive normalization, enabling conversion for unseen speakers without parallel data.
Findings
Significant improvement in speech quality over existing methods
Comparable speaker similarity to AUTOVC on unseen speakers
Effective zero-shot conversion without parallel corpora
Abstract
Non-parallel many-to-many voice conversion is recently attract-ing huge research efforts in the speech processing community. A voice conversion system transforms an utterance of a source speaker to another utterance of a target speaker by keeping the content in the original utterance and replacing by the vocal features from the target speaker. Existing solutions, e.g., StarGAN-VC2, present promising results, only when speech corpus of the engaged speakers is available during model training. AUTOVCis able to perform voice conversion on unseen speakers, but it needs an external pretrained speaker verification model. In this paper, we present our new GAN-based zero-shot voice conversion solution, called GAZEV, which targets to support unseen speakers on both source and target utterances. Our key technical contribution is the adoption of speaker embedding loss on top of the GAN framework,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
