NoiseVC: Towards High Quality Zero-Shot Voice Conversion

Shijun Wang; Damian Borth

arXiv:2104.06074·cs.SD·April 14, 2021·6 cites

NoiseVC: Towards High Quality Zero-Shot Voice Conversion

Shijun Wang, Damian Borth

PDF

Open Access

TL;DR

NoiseVC introduces a novel approach combining Vector Quantization, Contrastive Predictive Coding, and noise augmentation to improve zero-shot voice conversion quality and disentanglement without heavily constrained bottlenecks.

Contribution

The paper proposes NoiseVC, a new zero-shot voice conversion method that enhances disentanglement and quality using VQ, CPC, and noise augmentation techniques.

Findings

01

Strong disentanglement ability demonstrated

02

Small sacrifice in sound quality observed

03

Effective zero-shot voice conversion achieved

Abstract

Voice conversion (VC) is a task that transforms voice from target audio to source without losing linguistic contents, it is challenging especially when source and target speakers are unseen during training (zero-shot VC). Previous approaches require a pre-trained model or linguistic data to do the zero-shot conversion. Meanwhile, VC models with Vector Quantization (VQ) or Instance Normalization (IN) are able to disentangle contents from audios and achieve successful conversions. However, disentanglement in these models highly relies on heavily constrained bottleneck layers, thus, the sound quality is drastically sacrificed. In this paper, we propose NoiseVC, an approach that can disentangle contents based on VQ and Contrastive Predictive Coding (CPC). Additionally, Noise Augmentation is performed to further enhance disentanglement capability. We conduct several experiments and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsInfoNCE · Instance Normalization · Contrastive Predictive Coding