Training Robust Zero-Shot Voice Conversion Models with Self-supervised Features
Trung Dang, Dung Tran, Peter Chin, Kazuhito Koishida

TL;DR
This paper introduces a robust zero-shot voice conversion method leveraging self-supervised speech features, a length resampling decoder, and additional training techniques to improve speaker similarity and content preservation without parallel data.
Contribution
It proposes a novel length resampling decoder and training strategies that enhance zero-shot voice conversion quality using self-supervised features without changing the core architecture.
Findings
Outperforms baselines on VCTK dataset
Achieves high speaker similarity and content preservation
Best performance on LibriTTS with proposed techniques
Abstract
Unsupervised Zero-Shot Voice Conversion (VC) aims to modify the speaker characteristic of an utterance to match an unseen target speaker without relying on parallel training data. Recently, self-supervised learning of speech representation has been shown to produce useful linguistic units without using transcripts, which can be directly passed to a VC model. In this paper, we showed that high-quality audio samples can be achieved by using a length resampling decoder, which enables the VC model to work in conjunction with different linguistic feature extractors and vocoders without requiring them to operate on the same sequence length. We showed that our method can outperform many baselines on the VCTK dataset. Without modifying the architecture, we further demonstrated that a) using pairs of different audio segments from the same speaker, b) adding a cycle consistency loss, and c)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
