Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling
Chen Geng, Meng Chen, Ruohua Zhou, Ruolan Liu, Weifeng Zhao

TL;DR
Poly-SVC is a novel singing voice conversion system that effectively processes residual harmonies using a harmonic-aware approach, outperforming existing methods in naturalness and timbre similarity.
Contribution
It introduces a zero-shot, cross-lingual SVC system with a CQT-based pitch extractor and a diffusion decoder, capable of handling residual harmonies in polyphonic recordings.
Findings
Poly-SVC outperforms baseline models in naturalness and timbre similarity.
It effectively reconstructs harmonies in both harmony-rich and single-melody recordings.
The system demonstrates superior harmony preservation compared to existing methods.
Abstract
Singing Voice Conversion (SVC) aims to transform a source singing voice into a target singer while preserving lyrics and melody. Most existing SVC methods depend on F0 extractors to capture the lead melody from clean vocals. However, no existing method can reliably extract clean vocals from accompanied recordings without leaving residual harmonies behind. In this paper, we innovatively propose Poly-SVC, a zero-shot, cross-lingual singing voice conversion system designed to process residual harmonies. Poly-SVC is composed of three key components: a Constant-Q Transform (CQT)-based pitch extractor to preserve both the lead melody and residual harmony, a random sampler to reduce interference information from the CQT and a diffusion decoder based on Conditional Flow Matching (CFM) that fuses pitch, content, and timbre features into natural-sounding polyphonic outputs. Experiments…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
