KaraSinger: Score-Free Singing Voice Synthesis with VQ-VAE using   Mel-spectrograms

Chien-Feng Liao; Jen-Yu Liu; Yi-Hsuan Yang

arXiv:2110.04005·eess.AS·October 11, 2021·1 cites

KaraSinger: Score-Free Singing Voice Synthesis with VQ-VAE using Mel-spectrograms

Chien-Feng Liao, Jen-Yu Liu, Yi-Hsuan Yang

PDF

Open Access

TL;DR

KaraSinger is a novel score-free singing voice synthesis model that uses VQ-VAE and language modeling to generate singing without predefined scores, demonstrating high quality and intelligibility.

Contribution

It introduces a lightweight VQ-VAE and language model architecture for score-free SVS, enabling spontaneous melody and prosody generation from lyrics.

Findings

01

Achieved high intelligibility and musicality in listening tests.

02

Validated effectiveness on a proprietary dataset of 550 English pop songs.

03

Fast training and inference due to lightweight architecture.

Abstract

In this paper, we propose a novel neural network model called KaraSinger for a less-studied singing voice synthesis (SVS) task named score-free SVS, in which the prosody and melody are spontaneously decided by machine. KaraSinger comprises a vector-quantized variational autoencoder (VQ-VAE) that compresses the Mel-spectrograms of singing audio to sequences of discrete codes, and a language model (LM) that learns to predict the discrete codes given the corresponding lyrics. For the VQ-VAE part, we employ a Connectionist Temporal Classification (CTC) loss to encourage the discrete codes to carry phoneme-related information. For the LM part, we use location-sensitive attention for learning a robust alignment between the input phoneme sequence and the output discrete code. We keep the architecture of both the VQ-VAE and LM light-weight for fast training and inference speed. We validate the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing

MethodsTest · VQ-VAE