ReFlow-VC: Zero-shot Voice Conversion Based on Rectified Flow and Speaker Feature Optimization

Pengyu Ren; Wenhao Guan; Kaidi Wang; Peijie Chen; Qingyang Hong; Lin Li

arXiv:2506.01032·cs.SD·June 3, 2025

ReFlow-VC: Zero-shot Voice Conversion Based on Rectified Flow and Speaker Feature Optimization

Pengyu Ren, Wenhao Guan, Kaidi Wang, Peijie Chen, Qingyang Hong, Lin Li

PDF

Open Access

TL;DR

ReFlow-VC is a novel speech conversion method using rectified flow that achieves high fidelity and zero-shot performance with fewer sampling steps, optimizing speaker features for better accuracy.

Contribution

The paper introduces ReFlow-VC, a rectified flow-based model for speech conversion that reduces sampling steps and enhances zero-shot conversion accuracy.

Findings

01

Performs well on small datasets

02

Achieves high fidelity in zero-shot scenarios

03

Reduces sampling steps compared to diffusion models

Abstract

In recent years, diffusion-based generative models have demonstrated remarkable performance in speech conversion, including Denoising Diffusion Probabilistic Models (DDPM) and others. However, the advantages of these models come at the cost of requiring a large number of sampling steps. This limitation hinders their practical application in real-world scenarios. In this paper, we introduce ReFlow-VC, a novel high-fidelity speech conversion method based on rectified flow. Specifically, ReFlow-VC is an Ordinary Differential Equation (ODE) model that transforms a Gaussian distribution to the true Mel-spectrogram distribution along the most direct path. Furthermore, we propose a modeling approach that optimizes speaker features by utilizing both content and pitch information, allowing speaker features to reflect the properties of the current speech more accurately. Experimental results show…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing