Karaoker: Alignment-free singing voice synthesis with speech training data
Panos Kakoulidis, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos, Markopoulos, June Sig Sung, Gunu Jho, Pirros Tsiakoulis, Aimilios, Chalamandaris

TL;DR
Karaoker is a novel singing voice synthesis model that uses speech training data and does not require time-alignment or music scores, enabling style transfer and high-quality singing synthesis.
Contribution
It introduces a multispeaker Tacotron-based model trained solely on spoken data, incorporating multi-dimensional features and advanced training schemes for improved singing synthesis.
Findings
Effective style transfer from unseen speakers
High-quality singing voice synthesis without music score data
Improved model accuracy through multitasking and GAN training
Abstract
Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-based model conditioned on voice characteristic features that is trained exclusively on spoken data without requiring time-alignments. Karaoker synthesizes singing voice and transfers style following a multi-dimensional template extracted from a source waveform of an unseen singer/speaker. The model is jointly conditioned with a single deep convolutional encoder on continuous data including pitch, intensity, harmonicity, formants, cepstral peak prominence and octaves. We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks that guide the model to an accurate result. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
