ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations
Shehzeen Hussain, Paarth Neekhara, Jocelyn Huang, Jason Li, Boris, Ginsburg

TL;DR
This paper introduces ACE-VC, a zero-shot voice conversion method that uses self-supervised speech representations and disentangles content and speaker features for controllable, high-quality voice conversion with minimal data.
Contribution
It presents a novel multi-task, self-supervised framework with a Siamese network training strategy for disentangling speech features, enabling zero-shot, any-to-any voice conversion.
Findings
Achieves state-of-the-art speaker similarity, intelligibility, and naturalness metrics.
Performs effective voice swapping with only 10 seconds of target data.
Attains low speaker verification EER for both seen and unseen speakers.
Abstract
In this work, we propose a zero-shot voice conversion method using speech representations trained with self-supervised learning. First, we develop a multi-task model to decompose a speech utterance into features such as linguistic content, speaker characteristics, and speaking style. To disentangle content and speaker representations, we propose a training strategy based on Siamese networks that encourages similarity between the content representations of the original and pitch-shifted audio. Next, we develop a synthesis model with pitch and duration predictors that can effectively reconstruct the speech signal from its decomposed representation. Our framework allows controllable and speaker-adaptive synthesis to perform zero-shot any-to-any voice conversion achieving state-of-the-art results on metrics evaluating speaker similarity, intelligibility, and naturalness. Using just 10…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
