Semi-supervised Learning for Singing Synthesis Timbre
Jordi Bonada, Merlijn Blaauw

TL;DR
This paper introduces a semi-supervised singing synthesizer capable of learning new voices from audio data alone, eliminating the need for phonetic annotations, and achieves comparable quality to supervised methods.
Contribution
A novel semi-supervised model for singing synthesis that learns new voices without annotations, using a dual-encoder architecture trained on multi-singer data.
Findings
System performs comparably to supervised approaches in listening tests.
Unsupervised voice learning is effective with the proposed architecture.
The model generalizes well to new voices without additional annotations.
Abstract
We propose a semi-supervised singing synthesizer, which is able to learn new voices from audio data only, without any annotations such as phonetic segmentation. Our system is an encoder-decoder model with two encoders, linguistic and acoustic, and one (acoustic) decoder. In a first step, the system is trained in a supervised manner, using a labelled multi-singer dataset. Here, we ensure that the embeddings produced by both encoders are similar, so that we can later use the model with either acoustic or linguistic input features. To learn a new voice in an unsupervised manner, the pretrained acoustic encoder is used to train a decoder for the target singer. Finally, at inference, the pretrained linguistic encoder is used together with the decoder of the new voice, to produce acoustic features from linguistic input. We evaluate our system with a listening test and show that the results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
