PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by   Natural Language Prompts

Jixun Yao; Yuguang Yang; Yi Lei; Ziqian Ning; Yanni Hu; Yu Pan,; Jingjing Yin; Hongbin Zhou; Heng Lu; Lei Xie

arXiv:2309.09262·eess.AS·December 27, 2023·1 cites

PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts

Jixun Yao, Yuguang Yang, Yi Lei, Ziqian Ning, Yanni Hu, Yu Pan,, Jingjing Yin, Hongbin Zhou, Heng Lu, Lei Xie

PDF

Open Access

TL;DR

PromptVC introduces a novel style voice conversion method that uses natural language prompts and a latent diffusion model to generate expressive, interpretable style vectors, overcoming limitations of traditional reference-based approaches.

Contribution

It presents a new approach combining latent diffusion models with natural language prompts for flexible, interpretable style voice conversion, enhancing style diversity and expressiveness.

Findings

01

Effective style conversion demonstrated through subjective and objective evaluations.

02

Enhanced style diversity and interpretability achieved with natural language prompts.

03

Improved style expressiveness via discrete token embedding and duration prediction.

Abstract

Style voice conversion aims to transform the style of source speech to a desired style according to real-world application demands. However, the current style voice conversion approach relies on pre-defined labels or reference speech to control the conversion process, which leads to limitations in style diversity or falls short in terms of the intuitive and interpretability of style representation. In this study, we propose PromptVC, a novel style voice conversion approach that employs a latent diffusion model to generate a style vector driven by natural language prompts. Specifically, the style vector is extracted by a style encoder during training, and then the latent diffusion model is trained independently to sample the style vector from noise, with this process being conditioned on natural language prompts. To improve style expressiveness, we leverage HuBERT to extract discrete…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders

MethodsDiffusion · Latent Diffusion Model