PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts
Jixun Yao, Yuguang Yang, Yi Lei, Ziqian Ning, Yanni Hu, Yu Pan,, Jingjing Yin, Hongbin Zhou, Heng Lu, Lei Xie

TL;DR
PromptVC introduces a novel style voice conversion method that uses natural language prompts and a latent diffusion model to generate expressive, interpretable style vectors, overcoming limitations of traditional reference-based approaches.
Contribution
It presents a new approach combining latent diffusion models with natural language prompts for flexible, interpretable style voice conversion, enhancing style diversity and expressiveness.
Findings
Effective style conversion demonstrated through subjective and objective evaluations.
Enhanced style diversity and interpretability achieved with natural language prompts.
Improved style expressiveness via discrete token embedding and duration prediction.
Abstract
Style voice conversion aims to transform the style of source speech to a desired style according to real-world application demands. However, the current style voice conversion approach relies on pre-defined labels or reference speech to control the conversion process, which leads to limitations in style diversity or falls short in terms of the intuitive and interpretability of style representation. In this study, we propose PromptVC, a novel style voice conversion approach that employs a latent diffusion model to generate a style vector driven by natural language prompts. Specifically, the style vector is extracted by a style encoder during training, and then the latent diffusion model is trained independently to sample the style vector from noise, with this process being conditioned on natural language prompts. To improve style expressiveness, we leverage HuBERT to extract discrete…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders
MethodsDiffusion · Latent Diffusion Model
