MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing   Voice Synthesis via Classifier-free Diffusion Guidance

Semin Kim; Myeonghun Jeong; Hyeonseung Lee; Minchan Kim; Byoung Jin; Choi; Nam Soo Kim

arXiv:2406.05965·eess.AS·June 11, 2024

MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance

Semin Kim, Myeonghun Jeong, Hyeonseung Lee, Minchan Kim, Byoung Jin, Choi, Nam Soo Kim

PDF

Open Access

TL;DR

MakeSinger introduces a semi-supervised diffusion-based method for singing voice synthesis that leverages unlabeled data to improve voice quality and can synthesize singing voices from TTS data without additional singing samples.

Contribution

The paper presents a novel semi-supervised training approach for SVS using classifier-free diffusion guidance, enabling effective use of unlabeled data and cross-domain voice synthesis.

Findings

01

Semi-supervised training outperforms supervised baselines in quality and accuracy.

02

The method can synthesize TTS speakers' singing voices without their singing data.

03

Improved pronunciation, pitch accuracy, and overall voice quality.

Abstract

In this paper, we propose MakeSinger, a semi-supervised training method for singing voice synthesis (SVS) via classifier-free diffusion guidance. The challenge in SVS lies in the costly process of gathering aligned sets of text, pitch, and audio data. MakeSinger enables the training of the diffusion-based SVS model from any speech and singing voice data regardless of its labeling, thereby enhancing the quality of generated voices with large amount of unlabeled data. At inference, our novel dual guiding mechanism gives text and pitch guidance on the reverse diffusion step by estimating the score of masked input. Experimental results show that the model trained in a semi-supervised manner outperforms other baselines trained only on the labeled data in terms of pronunciation, pitch accuracy and overall quality. Furthermore, we demonstrate that by adding Text-to-Speech (TTS) data in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

MethodsDiffusion