Learn2Sing: Target Speaker Singing Voice Synthesis by learning from a Singing Teacher
Heyang Xue, Shan Yang, Yi Lei, Lei Xie, Xiulin Li

TL;DR
Learn2Sing enables target speakers to synthesize singing voices without needing their singing data, by leveraging a singing teacher's corpus and advanced modeling techniques.
Contribution
The paper introduces a novel method that synthesizes target speaker singing voices using only speech data and a singing teacher, eliminating the need for target singing recordings.
Findings
Effective synthesis of target speaker singing voice from speech data.
Disentanglement of singing style and speaker identity via domain adversarial training.
Successful application without target speaker's singing recordings.
Abstract
Singing voice synthesis has been paid rising attention with the rapid development of speech synthesis area. In general, a studio-level singing corpus is usually necessary to produce a natural singing voice from lyrics and music-related transcription. However, such a corpus is difficult to collect since it's hard for many of us to sing like a professional singer. In this paper, we propose an approach -- Learn2Sing that only needs a singing teacher to generate the target speakers' singing voice without their singing voice data. In our approach, a teacher's singing corpus and speech from multiple target speakers are trained in a frame-level auto-regressive acoustic model where singing and speaking share the common speaker embedding and style tag embedding. Meanwhile, since there is no music-related transcription for the target speaker, we use log-scale fundamental frequency (LF0) as an…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
