SaMoye: Zero-shot Singing Voice Conversion Model Based on Feature   Disentanglement and Enhancement

Zihao Wang; Le Ma; Yongsheng Feng; Xin Pan; Yuhang Jin; Kejun Zhang

arXiv:2407.07728·cs.SD·November 18, 2024

SaMoye: Zero-shot Singing Voice Conversion Model Based on Feature Disentanglement and Enhancement

Zihao Wang, Le Ma, Yongsheng Feng, Xin Pan, Yuhang Jin, Kejun Zhang

PDF

Open Access 1 Repo

TL;DR

SaMoye is a novel zero-shot singing voice conversion model that effectively disentangles features and enhances timbre, enabling high-quality conversion to diverse timbres including non-human sounds, with extensive dataset support.

Contribution

Introduces SaMoye, the first open-source zero-shot SVC model capable of converting singing voices to various timbres, including non-human, using advanced feature disentanglement and enhancement techniques.

Findings

01

Outperforms existing models in zero-shot SVC tasks

02

Effective conversion to non-human timbres like animals

03

Utilizes large-scale dataset for robust zero-shot performance

Abstract

Singing voice conversion (SVC) aims to convert a singer's voice to another singer's from a reference audio while keeping the original semantics. However, existing SVC methods can hardly perform zero-shot due to incomplete feature disentanglement or dependence on the speaker look-up table. We propose the first open-source high-quality zero-shot SVC model SaMoye that can convert singing to human and non-human timbre. SaMoye disentangles the singing voice's features into content, timbre, and pitch features, where we combine multiple ASR models and compress the content features to reduce timbre leaks. Besides, we enhance the timbre features by unfreezing the speaker encoder and mixing the speaker embedding with top-3 similar speakers. We also establish an unparalleled large-scale dataset to guarantee zero-shot performance, which comprises more than 1,815 hours of pure singing voice and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

carlwangchina/samoye-svc
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing