YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases
Gongyu Chen, Xiaoyu Zhang, Zhenqiang Weng, Junjie Zheng, Da Shen, Chaofan Ding, Wei-Qiang Zhang, Zihao Chen

TL;DR
YingMusic-SVC is a robust zero-shot singing voice conversion framework that improves naturalness, timbre similarity, and intelligibility in real-world scenarios by integrating flow-based models, singing-specific biases, and reinforcement learning.
Contribution
The paper introduces YingMusic-SVC, a novel zero-shot SVC system with singing-specific inductive biases, flow-based training, and reinforcement learning for enhanced real-world robustness.
Findings
Outperforms open-source baselines in timbre similarity and naturalness.
Maintains high-quality conversion under harmony interference and F0 errors.
Demonstrates effectiveness for real-world singing voice conversion applications.
Abstract
Singing voice conversion (SVC) aims to render the target singer's timbre while preserving melody and lyrics. However, existing zero-shot SVC systems remain fragile in real songs due to harmony interference, F0 errors, and the lack of inductive biases for singing. We propose YingMusic-SVC, a robust zero-shot framework that unifies continuous pre-training, robust supervised fine-tuning, and Flow-GRPO reinforcement learning. Our model introduces a singing-trained RVC timbre shifter for timbre-content disentanglement, an F0-aware timbre adaptor for dynamic vocal expression, and an energy-balanced rectified flow matching loss to enhance high-frequency fidelity. Experiments on a graded multi-track benchmark show that YingMusic-SVC achieves consistent improvements over strong open-source baselines in timbre similarity, intelligibility, and perceptual naturalness, especially under accompanied…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic Technology and Sound Studies · Speech Recognition and Synthesis · Music and Audio Processing
