Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with   Speech Reference

Shuqi Dai; Yunyun Wang; Roger B. Dannenberg; Zeyu Jin

arXiv:2501.13870·cs.SD·January 24, 2025

Everyone-Can-Sing: Zero-Shot Singing Voice Synthesis and Conversion with Speech Reference

Shuqi Dai, Yunyun Wang, Roger B. Dannenberg, Zeyu Jin

PDF

Open Access

TL;DR

This paper introduces a unified zero-shot framework for singing voice synthesis and conversion that leverages pre-trained embeddings and diffusion models, enabling high-quality, controllable singing voice generation and cloning from speech references.

Contribution

A novel unified zero-shot framework for SVS and SVC using pre-trained embeddings and diffusion models, addressing cross-domain limitations and data scarcity.

Findings

01

Significant improvements in timbre similarity and musicality.

02

Effective singing voice cloning from speech references.

03

Insights into low-data music tasks like instrumental style transfer.

Abstract

We propose a unified framework for Singing Voice Synthesis (SVS) and Conversion (SVC), addressing the limitations of existing approaches in cross-domain SVS/SVC, poor output musicality, and scarcity of singing data. Our framework enables control over multiple aspects, including language content based on lyrics, performance attributes based on a musical score, singing style and vocal techniques based on a selector, and voice identity based on a speech sample. The proposed zero-shot learning paradigm consists of one SVS model and two SVC models, utilizing pre-trained content embeddings and a diffusion-based generator. The proposed framework is also trained on mixed datasets comprising both singing and speech audio, allowing singing voice cloning based on speech reference. Experiments show substantial improvements in timbre similarity and musicality over state-of-the-art baselines,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing