Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching

Jialong Zuo; Shengpeng Ji; Minghui Fang; Mingze Li; Ziyue Jiang; Xize Cheng; Xiaoda Yang; Chen Feiyang; Xinyu Duan; Zhou Zhao

arXiv:2506.01014·eess.AS·June 3, 2025

Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching

Jialong Zuo, Shengpeng Ji, Minghui Fang, Mingze Li, Ziyue Jiang, Xize Cheng, Xiaoda Yang, Chen Feiyang, Xinyu Duan, Zhou Zhao

PDF

Open Access 1 Video

TL;DR

This paper introduces R-VC, a novel zero-shot voice conversion model that effectively transfers rhythm and timbre with high quality and efficiency, even with limited data, by using advanced flow matching and content discretization techniques.

Contribution

R-VC is the first model to combine rhythm controllability with efficient zero-shot voice conversion using shortcut flow matching and content token discretization.

Findings

01

Achieves comparable speaker similarity to state-of-the-art methods.

02

Surpasses existing methods in speech naturalness and intelligibility.

03

Operates effectively with smaller datasets and fewer sampling steps.

Abstract

Zero-Shot Voice Conversion (VC) aims to transform the source speaker's timbre into an arbitrary unseen one while retaining speech content. Most prior work focuses on preserving the source's prosody, while fine-grained timbre information may leak through prosody, and transferring target prosody to synthesized speech is rarely studied. In light of this, we propose R-VC, a rhythm-controllable and efficient zero-shot voice conversion model. R-VC employs data perturbation techniques and discretize source speech into Hubert content tokens, eliminating much content-irrelevant information. By leveraging a Mask Generative Transformer for in-context duration modeling, our model adapts the linguistic content duration to the desired target speaking style, facilitating the transfer of the target speaker's rhythm. Furthermore, R-VC introduces a powerful Diffusion Transformer (DiT) with shortcut flow…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsLinear Layer · Adam · Dense Connections · Softmax · Diffusion · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Multi-Head Attention · Byte Pair Encoding