Training Robust Zero-Shot Voice Conversion Models with Self-supervised   Features

Trung Dang; Dung Tran; Peter Chin; Kazuhito Koishida

arXiv:2112.04424·cs.SD·February 14, 2022

Training Robust Zero-Shot Voice Conversion Models with Self-supervised Features

Trung Dang, Dung Tran, Peter Chin, Kazuhito Koishida

PDF

Open Access

TL;DR

This paper introduces a robust zero-shot voice conversion method leveraging self-supervised speech features, a length resampling decoder, and additional training techniques to improve speaker similarity and content preservation without parallel data.

Contribution

It proposes a novel length resampling decoder and training strategies that enhance zero-shot voice conversion quality using self-supervised features without changing the core architecture.

Findings

01

Outperforms baselines on VCTK dataset

02

Achieves high speaker similarity and content preservation

03

Best performance on LibriTTS with proposed techniques

Abstract

Unsupervised Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker without relying on parallel training data. Recently, self-supervised learning of speech representation has been shown to produce useful linguistic units without using transcripts, which can be directly passed to a VC model. In this paper, we showed that high-quality audio samples can be achieved by using a length resampling decoder, which enables the VC model to work in conjunction with different linguistic feature extractors and vocoders without requiring them to operate on the same sequence length. We showed that our method can outperform many baselines on the VCTK dataset. Without modifying the architecture, we further demonstrated that a) using pairs of different audio segments from the same speaker, b) adding a cycle consistency loss, and c)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing