FreeSVC: Towards Zero-shot Multilingual Singing Voice Conversion
Alef Iury Siqueira Ferreira, Lucas Rafael Gris, Augusto Seben da Rosa,, Frederico Santos de Oliveira, Edresson Casanova, Rafael Teixeira Sousa,, Arnaldo Candido Junior, Anderson da Silva Soares, Arlindo Galv\~ao Filho

TL;DR
FreeSVC introduces a zero-shot multilingual singing voice conversion system that leverages advanced models and embeddings to enable cross-lingual conversion without extensive training, improving content representation and speaker disentanglement.
Contribution
It presents a novel zero-shot multilingual singing voice conversion method using enhanced VITS, speaker-invariant clustering, and language embeddings for improved cross-lingual performance.
Findings
Effective zero-shot cross-lingual conversion demonstrated
Multilingual content extractor improves conversion quality
Publicly available source code and models
Abstract
This work presents FreeSVC, a promising multilingual singing voice conversion approach that leverages an enhanced VITS model with Speaker-invariant Clustering (SPIN) for better content representation and the State-of-the-Art (SOTA) speaker encoder ECAPA2. FreeSVC incorporates trainable language embeddings to handle multiple languages and employs an advanced speaker encoder to disentangle speaker characteristics from linguistic content. Designed for zero-shot learning, FreeSVC enables cross-lingual singing voice conversion without extensive language-specific training. We demonstrate that a multilingual content extractor is crucial for optimal cross-language conversion. Our source code and models are publicly available.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
