ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly   Disentangled Self-supervised Speech Representations

Shehzeen Hussain; Paarth Neekhara; Jocelyn Huang; Jason Li; Boris; Ginsburg

arXiv:2302.08137·cs.SD·February 17, 2023·1 cites

ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly Disentangled Self-supervised Speech Representations

Shehzeen Hussain, Paarth Neekhara, Jocelyn Huang, Jason Li, Boris, Ginsburg

PDF

Open Access

TL;DR

This paper introduces ACE-VC, a zero-shot voice conversion method that uses self-supervised speech representations and disentangles content and speaker features for controllable, high-quality voice conversion with minimal data.

Contribution

It presents a novel multi-task, self-supervised framework with a Siamese network training strategy for disentangling speech features, enabling zero-shot, any-to-any voice conversion.

Findings

01

Achieves state-of-the-art speaker similarity, intelligibility, and naturalness metrics.

02

Performs effective voice swapping with only 10 seconds of target data.

03

Attains low speaker verification EER for both seen and unseen speakers.

Abstract

In this work, we propose a zero-shot voice conversion method using speech representations trained with self-supervised learning. First, we develop a multi-task model to decompose a speech utterance into features such as linguistic content, speaker characteristics, and speaking style. To disentangle content and speaker representations, we propose a training strategy based on Siamese networks that encourages similarity between the content representations of the original and pitch-shifted audio. Next, we develop a synthesis model with pitch and duration predictors that can effectively reconstruct the speech signal from its decomposed representation. Our framework allows controllable and speaker-adaptive synthesis to perform zero-shot any-to-any voice conversion achieving state-of-the-art results on metrics evaluating speaker similarity, intelligibility, and naturalness. Using just 10…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing