YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases

Gongyu Chen; Xiaoyu Zhang; Zhenqiang Weng; Junjie Zheng; Da Shen; Chaofan Ding; Wei-Qiang Zhang; Zihao Chen

arXiv:2512.04793·cs.SD·December 5, 2025

YingMusic-SVC: Real-World Robust Zero-Shot Singing Voice Conversion with Flow-GRPO and Singing-Specific Inductive Biases

Gongyu Chen, Xiaoyu Zhang, Zhenqiang Weng, Junjie Zheng, Da Shen, Chaofan Ding, Wei-Qiang Zhang, Zihao Chen

PDF

Open Access

TL;DR

YingMusic-SVC is a robust zero-shot singing voice conversion framework that improves naturalness, timbre similarity, and intelligibility in real-world scenarios by integrating flow-based models, singing-specific biases, and reinforcement learning.

Contribution

The paper introduces YingMusic-SVC, a novel zero-shot SVC system with singing-specific inductive biases, flow-based training, and reinforcement learning for enhanced real-world robustness.

Findings

01

Outperforms open-source baselines in timbre similarity and naturalness.

02

Maintains high-quality conversion under harmony interference and F0 errors.

03

Demonstrates effectiveness for real-world singing voice conversion applications.

Abstract

Singing voice conversion (SVC) aims to render the target singer's timbre while preserving melody and lyrics. However, existing zero-shot SVC systems remain fragile in real songs due to harmony interference, F0 errors, and the lack of inductive biases for singing. We propose YingMusic-SVC, a robust zero-shot framework that unifies continuous pre-training, robust supervised fine-tuning, and Flow-GRPO reinforcement learning. Our model introduces a singing-trained RVC timbre shifter for timbre-content disentanglement, an F0-aware timbre adaptor for dynamic vocal expression, and an energy-balanced rectified flow matching loss to enhance high-frequency fidelity. Experiments on a graded multi-track benchmark show that YingMusic-SVC achieves consistent improvements over strong open-source baselines in timbre similarity, intelligibility, and perceptual naturalness, especially under accompanied…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Speech Recognition and Synthesis · Music and Audio Processing