REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers

Yuepeng Jiang; Ziqian Ning; Shuai Wang; Chengjia Wang; Mengxiao Bi; Pengcheng Zhu; Zhonghua Fu; Lei Xie

arXiv:2508.04996·eess.AS·August 11, 2025

REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers

Yuepeng Jiang, Ziqian Ning, Shuai Wang, Chengjia Wang, Mengxiao Bi, Pengcheng Zhu, Zhonghua Fu, Lei Xie

PDF

Open Access

TL;DR

REF-VC is a novel zero-shot voice conversion system that combines noise robustness, expressiveness, and speed by using innovative strategies like random erasing, implicit alignment, and shortcut models, outperforming existing methods especially in noisy environments.

Contribution

The paper introduces REF-VC, a new voice conversion framework that enhances noise robustness and expressiveness while significantly reducing inference steps, with novel techniques like random erasing and implicit alignment.

Findings

01

Outperforms baselines like Seed-VC in noisy zero-shot scenarios

02

Maintains high quality on clean data comparable to Seed-VC

03

Enables singing voice conversion within one unified model

Abstract

In real-world voice conversion applications, environmental noise in source speech and user demands for expressive output pose critical challenges. Traditional ASR-based methods ensure noise robustness but suppress prosody richness, while SSL-based models improve expressiveness but suffer from timbre leakage and noise sensitivity. This paper proposes REF-VC, a noise-robust expressive voice conversion system. Key innovations include: (1) A random erasing strategy to mitigate the information redundancy inherent in SSL features, enhancing noise robustness and expressiveness; (2) Implicit alignment inspired by E2TTS to suppress non-essential feature reconstruction; (3) Integration of Shortcut Models to accelerate flow matching inference, significantly reducing to 4 steps. Experimental results demonstrate that REF-VC outperforms baselines such as Seed-VC in zero-shot scenarios on the noisy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques