VITS-Based Singing Voice Conversion Leveraging Whisper and multi-scale F0 Modeling
Ziqian Ning, Yuepeng Jiang, Zhichao Wang, Bin Zhang, Lei Xie

TL;DR
This paper presents a VITS-based singing voice conversion system that leverages Whisper for feature extraction and multi-scale F0 modeling, achieving top performance in the Singing Voice Conversion Challenge 2023.
Contribution
The system introduces a novel combination of Whisper-based bottleneck features and multi-scale F0 modeling within a VITS framework, along with a three-stage training strategy for limited data adaptation.
Findings
Achieved top rankings in naturalness in the challenge
Effective removal of source speaker timbre through pitch perturbation
Ablation studies confirm the effectiveness of each system component
Abstract
This paper introduces the T23 team's system submitted to the Singing Voice Conversion Challenge 2023. Following the recognition-synthesis framework, our singing conversion model is based on VITS, incorporating four key modules: a prior encoder, a posterior encoder, a decoder, and a parallel bank of transposed convolutions (PBTC) module. We particularly leverage Whisper, a powerful pre-trained ASR model, to extract bottleneck features (BNF) as the input of the prior encoder. Before BNF extraction, we perform pitch perturbation to the source signal to remove speaker timbre, which effectively avoids the leakage of the source speaker timbre to the target. Moreover, the PBTC module extracts multi-scale F0 as the auxiliary input to the prior encoder, thereby capturing better pitch variations of singing. We design a three-stage training strategy to better adapt the base model to the target…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsBalanced Selection
