KaraSinger: Score-Free Singing Voice Synthesis with VQ-VAE using Mel-spectrograms
Chien-Feng Liao, Jen-Yu Liu, Yi-Hsuan Yang

TL;DR
KaraSinger is a novel score-free singing voice synthesis model that uses VQ-VAE and language modeling to generate singing without predefined scores, demonstrating high quality and intelligibility.
Contribution
It introduces a lightweight VQ-VAE and language model architecture for score-free SVS, enabling spontaneous melody and prosody generation from lyrics.
Findings
Achieved high intelligibility and musicality in listening tests.
Validated effectiveness on a proprietary dataset of 550 English pop songs.
Fast training and inference due to lightweight architecture.
Abstract
In this paper, we propose a novel neural network model called KaraSinger for a less-studied singing voice synthesis (SVS) task named score-free SVS, in which the prosody and melody are spontaneously decided by machine. KaraSinger comprises a vector-quantized variational autoencoder (VQ-VAE) that compresses the Mel-spectrograms of singing audio to sequences of discrete codes, and a language model (LM) that learns to predict the discrete codes given the corresponding lyrics. For the VQ-VAE part, we employ a Connectionist Temporal Classification (CTC) loss to encourage the discrete codes to carry phoneme-related information. For the LM part, we use location-sensitive attention for learning a robust alignment between the input phoneme sequence and the output discrete code. We keep the architecture of both the VQ-VAE and LM light-weight for fast training and inference speed. We validate the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsTest · VQ-VAE
