Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker   SVS by Learning from Singing Teacher

Heyang Xue; Xinsheng Wang; Yongmao Zhang; Lei Xie; Pengcheng Zhu,; Mengxiao Bi

arXiv:2203.16408·cs.SD·May 27, 2022

Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher

Heyang Xue, Xinsheng Wang, Yongmao Zhang, Lei Xie, Pengcheng Zhu,, Mengxiao Bi

PDF

Open Access 1 Repo

TL;DR

Learn2Sing 2.0 introduces a diffusion-based method with mutual information constraints to synthesize high-quality singing voices for target speakers without requiring their singing data, by leveraging data from singing teachers.

Contribution

It presents a novel diffusion and mutual information-based framework that enables target speaker singing voice synthesis without individual singing data, improving flexibility and quality.

Findings

01

Capable of synthesizing high-quality singing voices with only 10 decoding steps.

02

Effective separation of speaker and style information during training.

03

Achieves realistic singing synthesis for unseen speakers.

Abstract

Building a high-quality singing corpus for a person who is not good at singing is non-trivial, thus making it challenging to create a singing voice synthesizer for this person. Learn2Sing is dedicated to synthesizing the singing voice of a speaker without his or her singing data by learning from data recorded by others, i.e., the singing teacher. Inspired by the fact that pitch is the key style factor to distinguish singing from speaking voice, the proposed Learn2Sing 2.0 first generates the preliminary acoustic feature with averaged pitch value in the phone level, which allows the training of this process for different styles, i.e., speaking or singing, share same conditions except for the speaker information. Then, conditioned on the specific style, a diffusion decoder, which is accelerated by a fast sampling algorithm during the inference stage, is adopted to gradually restore the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

WelkinYang/Learn2Sing2.0/tree/main/code
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing