StarGAN-ZSVC: Towards Zero-Shot Voice Conversion in Low-Resource Contexts
Matthew Baas, Herman Kamper

TL;DR
This paper introduces StarGAN-ZSVC, a GAN-based model that enables real-time, zero-shot voice conversion without parallel data, trained on minimal data, and effective with unseen speakers.
Contribution
It extends StarGAN-VC by incorporating speaker embeddings for zero-shot conversion, demonstrating real-time performance with limited training data.
Findings
StarGAN-ZSVC outperforms AutoVC in zero-shot settings.
Real-time zero-shot voice conversion is achievable with minimal data.
Further scaling may improve high-resource zero-shot performance.
Abstract
Voice conversion is the task of converting a spoken utterance from a source speaker so that it appears to be said by a different target speaker while retaining the linguistic content of the utterance. Recent advances have led to major improvements in the quality of voice conversion systems. However, to be useful in a wider range of contexts, voice conversion systems would need to be (i) trainable without access to parallel data, (ii) work in a zero-shot setting where both the source and target speakers are unseen during training, and (iii) run in real time or faster. Recent techniques fulfil one or two of these requirements, but not all three. This paper extends recent voice conversion models based on generative adversarial networks (GANs), to satisfy all three of these conditions. We specifically extend the recent StarGAN-VC model by conditioning it on a speaker embedding (from a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
