Zero-shot Voice Conversion via Self-supervised Prosody Representation   Learning

Shijun Wang; Dimche Kostadinov; Damian Borth

arXiv:2110.14422·cs.SD·June 1, 2022

Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning

Shijun Wang, Dimche Kostadinov, Damian Borth

PDF

Open Access

TL;DR

This paper introduces a self-supervised method to learn disentangled prosody representations, such as pitch and volume, to improve zero-shot voice conversion accuracy and reduce prosody leakage, surpassing current state-of-the-art methods.

Contribution

It proposes a novel self-supervised approach for learning disentangled prosody representations to enhance zero-shot voice conversion performance.

Findings

01

Prosody representations are disentangled and rich in information.

02

Adding prosody representations improves VC performance.

03

Our method surpasses state-of-the-art zero-shot VC results.

Abstract

Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, is an attractive research topic as it enables a range of applications like voice customizing, animation production, and others. Recent work in this area made progress with disentanglement methods that separate utterance content and speaker characteristics from speech audio recordings. However, many of these methods are subject to the leakage of prosody (e.g., pitch, volume), causing the speaker voice in the synthesized speech to be different from the desired target speakers. To prevent this issue, we propose a novel self-supervised approach that effectively learns disentangled pitch and volume representations that can represent the prosody styles of different speakers. We then use the learned prosodic representations as conditional information to train and enhance our VC model for zero-shot conversion. In our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing