Enhancing Expressive Voice Conversion with Discrete Pitch-Conditioned   Flow Matching Model

Jialong Zuo; Shengpeng Ji; Minghui Fang; Ziyue Jiang; Xize Cheng; Qian; Yang; Wenrui Liu; Guangyan Zhang; Zehai Tu; Yiwen Guo; Zhou Zhao

arXiv:2502.05471·cs.SD·February 11, 2025

Enhancing Expressive Voice Conversion with Discrete Pitch-Conditioned Flow Matching Model

Jialong Zuo, Shengpeng Ji, Minghui Fang, Ziyue Jiang, Xize Cheng, Qian, Yang, Wenrui Liu, Guangyan Zhang, Zehai Tu, Yiwen Guo, Zhou Zhao

PDF

Open Access

TL;DR

This paper presents PFlow-VC, a novel voice conversion model that uses discrete pitch tokens and target speaker prompts to enhance expressiveness, style transfer, and timbre similarity in speech synthesis.

Contribution

It introduces a simple, efficient approach combining self-supervised pitch discretization and flow matching for improved expressive voice conversion.

Findings

01

Outperforms previous models in timbre and style transfer on LibriTTS and ESD datasets.

02

Effectively models in-context pitch for more natural voice conversion.

03

Enhances timbre similarity by integrating global and time-varying timbre embeddings.

Abstract

This paper introduces PFlow-VC, a conditional flow matching voice conversion model that leverages fine-grained discrete pitch tokens and target speaker prompt information for expressive voice conversion (VC). Previous VC works primarily focus on speaker conversion, with further exploration needed in enhancing expressiveness (such as prosody and emotion) for timbre conversion. Unlike previous methods, we adopt a simple and efficient approach to enhance the style expressiveness of voice conversion models. Specifically, we pretrain a self-supervised pitch VQVAE model to discretize speaker-irrelevant pitch information and leverage a masked pitch-conditioned flow matching model for Mel-spectrogram synthesis, which provides in-context pitch modeling capabilities for the speaker conversion model, effectively improving the voice style transfer capacity. Additionally, we improve timbre…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems