MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance
Semin Kim, Myeonghun Jeong, Hyeonseung Lee, Minchan Kim, Byoung Jin, Choi, Nam Soo Kim

TL;DR
MakeSinger introduces a semi-supervised diffusion-based method for singing voice synthesis that leverages unlabeled data to improve voice quality and can synthesize singing voices from TTS data without additional singing samples.
Contribution
The paper presents a novel semi-supervised training approach for SVS using classifier-free diffusion guidance, enabling effective use of unlabeled data and cross-domain voice synthesis.
Findings
Semi-supervised training outperforms supervised baselines in quality and accuracy.
The method can synthesize TTS speakers' singing voices without their singing data.
Improved pronunciation, pitch accuracy, and overall voice quality.
Abstract
In this paper, we propose MakeSinger, a semi-supervised training method for singing voice synthesis (SVS) via classifier-free diffusion guidance. The challenge in SVS lies in the costly process of gathering aligned sets of text, pitch, and audio data. MakeSinger enables the training of the diffusion-based SVS model from any speech and singing voice data regardless of its labeling, thereby enhancing the quality of generated voices with large amount of unlabeled data. At inference, our novel dual guiding mechanism gives text and pitch guidance on the reverse diffusion step by estimating the score of masked input. Experimental results show that the model trained in a semi-supervised manner outperforms other baselines trained only on the labeled data in terms of pronunciation, pitch accuracy and overall quality. Furthermore, we demonstrate that by adding Text-to-Speech (TTS) data in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
MethodsDiffusion
