Karaoker: Alignment-free singing voice synthesis with speech training   data

Panos Kakoulidis; Nikolaos Ellinas; Georgios Vamvoukakis; Konstantinos; Markopoulos; June Sig Sung; Gunu Jho; Pirros Tsiakoulis; Aimilios; Chalamandaris

arXiv:2204.04127·eess.AS·September 30, 2022·1 cites

Karaoker: Alignment-free singing voice synthesis with speech training data

Panos Kakoulidis, Nikolaos Ellinas, Georgios Vamvoukakis, Konstantinos, Markopoulos, June Sig Sung, Gunu Jho, Pirros Tsiakoulis, Aimilios, Chalamandaris

PDF

Open Access

TL;DR

Karaoker is a novel singing voice synthesis model that uses speech training data and does not require time-alignment or music scores, enabling style transfer and high-quality singing synthesis.

Contribution

It introduces a multispeaker Tacotron-based model trained solely on spoken data, incorporating multi-dimensional features and advanced training schemes for improved singing synthesis.

Findings

01

Effective style transfer from unseen speakers

02

High-quality singing voice synthesis without music score data

03

Improved model accuracy through multitasking and GAN training

Abstract

Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone time-alignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-based model conditioned on voice characteristic features that is trained exclusively on spoken data without requiring time-alignments. Karaoker synthesizes singing voice and transfers style following a multi-dimensional template extracted from a source waveform of an unseen singer/speaker. The model is jointly conditioned with a single deep convolutional encoder on continuous data including pitch, intensity, harmonicity, formants, cepstral peak prominence and octaves. We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks that guide the model to an accurate result. In…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing