A Melody-Unsupervision Model for Singing Voice Synthesis

Soonbeom Choi; Juhan Nam

arXiv:2110.06546·eess.AS·April 15, 2022

A Melody-Unsupervision Model for Singing Voice Synthesis

Soonbeom Choi, Juhan Nam

PDF

Open Access

TL;DR

This paper introduces a melody-unsupervised singing voice synthesis model that trains without temporal alignment labels, enabling high-quality singing voice generation from audio and lyrics alone, and can be fine-tuned with varying supervision levels.

Contribution

It presents a novel end-to-end model that reduces the need for manual alignment in training and can be trained with speech data to generate singing voices.

Findings

01

The model achieves comparable audio quality in semi-supervised settings.

02

It can generate singing voices from speech audio and text labels.

03

Fine-tuning with different supervision levels improves performance.

Abstract

Recent studies in singing voice synthesis have achieved high-quality results leveraging advances in text-to-speech models based on deep neural networks. One of the main issues in training singing voice synthesis models is that they require melody and lyric labels to be temporally aligned with audio data. The temporal alignment is a time-exhausting manual work in preparing for the training data. To address the issue, we propose a melody-unsupervision model that requires only audio-and-lyrics pairs without temporal alignment in training time but generates singing voice audio given a melody and lyrics input in inference time. The proposed model is composed of a phoneme classifier and a singing voice generator jointly trained in an end-to-end manner. The model can be fine-tuned by adjusting the amount of supervision with temporally aligned melody labels. Through experiments in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing