Zero-shot Voice Conversion via Self-supervised Prosody Representation Learning
Shijun Wang, Dimche Kostadinov, Damian Borth

TL;DR
This paper introduces a self-supervised method to learn disentangled prosody representations, such as pitch and volume, to improve zero-shot voice conversion accuracy and reduce prosody leakage, surpassing current state-of-the-art methods.
Contribution
It proposes a novel self-supervised approach for learning disentangled prosody representations to enhance zero-shot voice conversion performance.
Findings
Prosody representations are disentangled and rich in information.
Adding prosody representations improves VC performance.
Our method surpasses state-of-the-art zero-shot VC results.
Abstract
Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, is an attractive research topic as it enables a range of applications like voice customizing, animation production, and others. Recent work in this area made progress with disentanglement methods that separate utterance content and speaker characteristics from speech audio recordings. However, many of these methods are subject to the leakage of prosody (e.g., pitch, volume), causing the speaker voice in the synthesized speech to be different from the desired target speakers. To prevent this issue, we propose a novel self-supervised approach that effectively learns disentangled pitch and volume representations that can represent the prosody styles of different speakers. We then use the learned prosodic representations as conditional information to train and enhance our VC model for zero-shot conversion. In our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
