VITS-Based Singing Voice Conversion Leveraging Whisper and multi-scale   F0 Modeling

Ziqian Ning; Yuepeng Jiang; Zhichao Wang; Bin Zhang; Lei Xie

arXiv:2310.02802·eess.AS·October 5, 2023·ASRU

VITS-Based Singing Voice Conversion Leveraging Whisper and multi-scale F0 Modeling

Ziqian Ning, Yuepeng Jiang, Zhichao Wang, Bin Zhang, Lei Xie

PDF

Open Access

TL;DR

This paper presents a VITS-based singing voice conversion system that leverages Whisper for feature extraction and multi-scale F0 modeling, achieving top performance in the Singing Voice Conversion Challenge 2023.

Contribution

The system introduces a novel combination of Whisper-based bottleneck features and multi-scale F0 modeling within a VITS framework, along with a three-stage training strategy for limited data adaptation.

Findings

01

Achieved top rankings in naturalness in the challenge

02

Effective removal of source speaker timbre through pitch perturbation

03

Ablation studies confirm the effectiveness of each system component

Abstract

This paper introduces the T23 team's system submitted to the Singing Voice Conversion Challenge 2023. Following the recognition-synthesis framework, our singing conversion model is based on VITS, incorporating four key modules: a prior encoder, a posterior encoder, a decoder, and a parallel bank of transposed convolutions (PBTC) module. We particularly leverage Whisper, a powerful pre-trained ASR model, to extract bottleneck features (BNF) as the input of the prior encoder. Before BNF extraction, we perform pitch perturbation to the source signal to remove speaker timbre, which effectively avoids the leakage of the source speaker timbre to the target. Moreover, the PBTC module extracts multi-scale F0 as the auxiliary input to the prior encoder, thereby capturing better pitch variations of singing. We design a three-stage training strategy to better adapt the base model to the target…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsBalanced Selection