A Comparative Study of Voice Conversion Models with Large-Scale Speech   and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge   2023

Ryuichi Yamamoto; Reo Yoneyama; Lester Phillip Violeta; Wen-Chin; Huang; Tomoki Toda

arXiv:2310.05203·eess.AS·October 10, 2023

A Comparative Study of Voice Conversion Models with Large-Scale Speech and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge 2023

Ryuichi Yamamoto, Reo Yoneyama, Lester Phillip Violeta, Wen-Chin, Huang, Tomoki Toda

PDF

Open Access

TL;DR

This paper introduces the T13 voice conversion system for SVCC 2023, utilizing a diffusion-based recognition-synthesis approach trained on large-scale data, achieving competitive results in singing voice conversion tasks.

Contribution

The paper presents a data-efficient, large-scale diffusion-based voice conversion model that generalizes well across singing and speech domains, with effective fine-tuning for target speakers.

Findings

01

Large-scale training improves cross-domain SVC performance.

02

The system achieves high naturalness and speaker similarity.

03

Large datasets benefit objective evaluation metrics.

Abstract

This paper presents our systems (denoted as T13) for the singing voice conversion challenge (SVCC) 2023. For both in-domain and cross-domain English singing voice conversion (SVC) tasks (Task 1 and Task 2), we adopt a recognition-synthesis approach with self-supervised learning-based representation. To achieve data-efficient SVC with a limited amount of target singer/speaker's data (150 to 160 utterances for SVCC 2023), we first train a diffusion-based any-to-any voice conversion model using publicly available large-scale 750 hours of speech and singing data. Then, we finetune the model for each target singer/speaker of Task 1 and Task 2. Large-scale listening tests conducted by SVCC 2023 show that our T13 system achieves competitive naturalness and speaker similarity for the harder cross-domain SVC (Task 2), which implies the generalization ability of our proposed method. Our objective…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing