StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from   Style-Based TTS Models

Yinghao Aaron Li; Cong Han; Nima Mesgarani

arXiv:2212.14227·eess.AS·January 2, 2023·1 cites

StyleTTS-VC: One-Shot Voice Conversion by Knowledge Transfer from Style-Based TTS Models

Yinghao Aaron Li, Cong Han, Nima Mesgarani

PDF

Open Access 1 Repo

TL;DR

StyleTTS-VC introduces a novel transfer learning approach from style-based TTS models to achieve high-fidelity, one-shot voice conversion without text input, significantly outperforming previous methods in naturalness and similarity.

Contribution

The paper presents a new method for disentangled speech representation using knowledge transfer from style-based TTS models, enabling effective one-shot voice conversion without text.

Findings

01

Outperforms previous state-of-the-art in naturalness and similarity

02

Uses cycle consistent and adversarial training for high fidelity

03

Employs a novel data augmentation scheme for disentanglement

Abstract

One-shot voice conversion (VC) aims to convert speech from any source speaker to an arbitrary target speaker with only a few seconds of reference speech from the target speaker. This relies heavily on disentangling the speaker's identity and speech content, a task that still remains challenging. Here, we propose a novel approach to learning disentangled speech representation by transfer learning from style-based text-to-speech (TTS) models. With cycle consistent and adversarial training, the style-based TTS models can perform transcription-guided one-shot VC with high fidelity and similarity. By learning an additional mel-spectrogram encoder through a teacher-student knowledge transfer and novel data augmentation scheme, our approach results in disentangled speech representation without needing the input text. The subjective evaluation shows that our approach can significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yl4579/StyleTTS-VC
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders