Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a   Conditional Diffusion Model

Zongyang Du; Junchen Lu; Kun Zhou; Lakshmish Kaushik; Berrak Sisman

arXiv:2405.01730·eess.AS·May 6, 2024

Converting Anyone's Voice: End-to-End Expressive Voice Conversion with a Conditional Diffusion Model

Zongyang Du, Junchen Lu, Kun Zhou, Lakshmish Kaushik, Berrak Sisman

PDF

Open Access

TL;DR

This paper introduces an end-to-end expressive voice conversion framework using a conditional diffusion model, effectively modeling emotional style and speaker identity without relying on vocoders, leading to improved speech quality.

Contribution

It presents a novel fully end-to-end voice conversion method based on a conditional diffusion model that jointly models speaker identity and emotional style using speech units and deep features.

Findings

01

Effective emotion and speaker identity modeling demonstrated through evaluations

02

Outperforms vocoder-based approaches in speech quality

03

Framework is publicly available for further research

Abstract

Expressive voice conversion (VC) conducts speaker identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Emotional style modeling for arbitrary speakers in expressive VC has not been extensively explored. Previous approaches have relied on vocoders for speech reconstruction, which makes speech quality heavily dependent on the performance of vocoders. A major challenge of expressive VC lies in emotion prosody modeling. To address these challenges, this paper proposes a fully end-to-end expressive VC framework based on a conditional denoising diffusion probabilistic model (DDPM). We utilize speech units derived from self-supervised speech models as content conditioning, along with deep features extracted from speech emotion recognition and speaker verification systems to model emotional style and speaker identity. Objective and subjective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsDiffusion