VITS-based Singing Voice Conversion System with DSPGAN post-processing   for SVCC2023

Yiquan Zhou; Meng Chen; Yi Lei; Jihua Zhu; Weifeng Zhao

arXiv:2310.05118·cs.SD·October 10, 2023·1 cites

VITS-based Singing Voice Conversion System with DSPGAN post-processing for SVCC2023

Yiquan Zhou, Meng Chen, Yi Lei, Jihua Zhu, Weifeng Zhao

PDF

Open Access

TL;DR

This paper introduces a VITS-based singing voice conversion system enhanced with DSPGAN post-processing, achieving top challenge rankings by combining feature extraction, voice conversion, and advanced vocoding techniques.

Contribution

The paper presents a novel SVCC2023 system integrating a VITS model with DSPGAN for improved audio quality and a two-stage training strategy for limited data adaptation.

Findings

01

Achieved 1st in naturalness and 2nd in similarity in SVCC2023.

02

Effective use of DSPGAN for waveform synthesis.

03

Two-stage training improves target speaker adaptation.

Abstract

This paper presents the T02 team's system for the Singing Voice Conversion Challenge 2023 (SVCC2023). Our system entails a VITS-based SVC model, incorporating three modules: a feature extractor, a voice converter, and a post-processor. Specifically, the feature extractor provides F0 contours and extracts speaker-independent linguistic content from the input singing voice by leveraging a HuBERT model. The voice converter is employed to recompose the speaker timbre, F0, and linguistic content to generate the waveform of the target speaker. Besides, to further improve the audio quality, a fine-tuned DSPGAN vocoder is introduced to re-synthesise the waveform. Given the limited target speaker data, we utilize a two-stage training strategy to adapt the base model to the target speaker. During model adaptation, several tricks, such as data augmentation and joint training with auxiliary singer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing