NoiseVC: Towards High Quality Zero-Shot Voice Conversion
Shijun Wang, Damian Borth

TL;DR
NoiseVC introduces a novel approach combining Vector Quantization, Contrastive Predictive Coding, and noise augmentation to improve zero-shot voice conversion quality and disentanglement without heavily constrained bottlenecks.
Contribution
The paper proposes NoiseVC, a new zero-shot voice conversion method that enhances disentanglement and quality using VQ, CPC, and noise augmentation techniques.
Findings
Strong disentanglement ability demonstrated
Small sacrifice in sound quality observed
Effective zero-shot voice conversion achieved
Abstract
Voice conversion (VC) is a task that transforms voice from target audio to source without losing linguistic contents, it is challenging especially when source and target speakers are unseen during training (zero-shot VC). Previous approaches require a pre-trained model or linguistic data to do the zero-shot conversion. Meanwhile, VC models with Vector Quantization (VQ) or Instance Normalization (IN) are able to disentangle contents from audios and achieve successful conversions. However, disentanglement in these models highly relies on heavily constrained bottleneck layers, thus, the sound quality is drastically sacrificed. In this paper, we propose NoiseVC, an approach that can disentangle contents based on VQ and Contrastive Predictive Coding (CPC). Additionally, Noise Augmentation is performed to further enhance disentanglement capability. We conduct several experiments and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsInfoNCE · Instance Normalization · Contrastive Predictive Coding
