HybridVC: Efficient Voice Style Conversion with Text and Audio Prompts

Xinlei Niu; Jing Zhang; Charles Patrick Martin

arXiv:2404.15637·cs.SD·September 26, 2024

HybridVC: Efficient Voice Style Conversion with Text and Audio Prompts

Xinlei Niu, Jing Zhang, Charles Patrick Martin

PDF

Open Access

TL;DR

HybridVC is a novel voice conversion framework that leverages a pre-trained CVAE with contrastive learning, supporting text and audio prompts for flexible, efficient, and multi-modal voice style conversion.

Contribution

It introduces HybridVC, a new VC model combining latent modeling and contrastive learning, enabling flexible prompts and efficient training with limited resources.

Findings

01

Superior training efficiency demonstrated in experiments.

02

Effective multi-modal voice style conversion achieved.

03

Validated through comprehensive ablation studies.

Abstract

We introduce HybridVC, a voice conversion (VC) framework built upon a pre-trained conditional variational autoencoder (CVAE) that combines the strengths of a latent model with contrastive learning. HybridVC supports text and audio prompts, enabling more flexible voice style conversion. HybridVC models a latent distribution conditioned on speaker embeddings acquired by a pretrained speaker encoder and optimises style text embeddings to align with the speaker style information through contrastive learning in parallel. Therefore, HybridVC can be efficiently trained under limited computational resources. Our experiments demonstrate HybridVC's superior training efficiency and its capability for advanced multi-modal voice style conversion. This underscores its potential for widespread applications such as user-defined personalised voice in various social media platforms. A comprehensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques