R2-SVC: Towards Real-World Robust and Expressive Zero-shot Singing Voice Conversion
Junjie Zheng, Gongyu Chen, Chaofan Ding, Zihao Chen

TL;DR
R2-SVC is a novel singing voice conversion framework that enhances robustness to noise and artifacts, improves expressiveness, and achieves state-of-the-art results in real-world noisy environments.
Contribution
It introduces simulation-based robustness, enriched speaker representations, and NSF integration for natural and controllable singing voice conversion.
Findings
State-of-the-art performance under noisy conditions
Improved robustness through data augmentation techniques
Enhanced naturalness and expressiveness of converted singing
Abstract
In real-world singing voice conversion (SVC) applications, environmental noise and the demand for expressive output pose significant challenges. Conventional methods, however, are typically designed without accounting for real deployment scenarios, as both training and inference usually rely on clean data. This mismatch hinders practical use, given the inevitable presence of diverse noise sources and artifacts from music separation. To tackle these issues, we propose R2-SVC, a robust and expressive SVC framework. First, we introduce simulation-based robustness enhancement through random fundamental frequency () perturbations and music separation artifact simulations (e.g., reverberation, echo), substantially improving performance under noisy conditions. Second, we enrich speaker representation using domain-specific singing data: alongside clean vocals, we incorporate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders
