HybridVC: Efficient Voice Style Conversion with Text and Audio Prompts
Xinlei Niu, Jing Zhang, Charles Patrick Martin

TL;DR
HybridVC is a novel voice conversion framework that leverages a pre-trained CVAE with contrastive learning, supporting text and audio prompts for flexible, efficient, and multi-modal voice style conversion.
Contribution
It introduces HybridVC, a new VC model combining latent modeling and contrastive learning, enabling flexible prompts and efficient training with limited resources.
Findings
Superior training efficiency demonstrated in experiments.
Effective multi-modal voice style conversion achieved.
Validated through comprehensive ablation studies.
Abstract
We introduce HybridVC, a voice conversion (VC) framework built upon a pre-trained conditional variational autoencoder (CVAE) that combines the strengths of a latent model with contrastive learning. HybridVC supports text and audio prompts, enabling more flexible voice style conversion. HybridVC models a latent distribution conditioned on speaker embeddings acquired by a pretrained speaker encoder and optimises style text embeddings to align with the speaker style information through contrastive learning in parallel. Therefore, HybridVC can be efficiently trained under limited computational resources. Our experiments demonstrate HybridVC's superior training efficiency and its capability for advanced multi-modal voice style conversion. This underscores its potential for widespread applications such as user-defined personalised voice in various social media platforms. A comprehensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Advanced Data Compression Techniques
