GenVC: Self-Supervised Zero-Shot Voice Conversion

Zexin Cai; Henry Li Xinyuan; Ashi Garg; Leibny Paola Garc\'ia-Perera; Kevin Duh; Sanjeev Khudanpur; Matthew Wiesner; Nicholas Andrews

arXiv:2502.04519·eess.AS·August 21, 2025

GenVC: Self-Supervised Zero-Shot Voice Conversion

Zexin Cai, Henry Li Xinyuan, Ashi Garg, Leibny Paola Garc\'ia-Perera, Kevin Duh, Sanjeev Khudanpur, Matthew Wiesner, Nicholas Andrews

PDF

Open Access 1 Models

TL;DR

GenVC is a self-supervised zero-shot voice conversion framework that disentangles speaker identity from content, achieving high speaker similarity and privacy protection without external supervision, using speech tokenizers and Transformer models.

Contribution

It introduces a novel self-supervised approach for voice conversion that removes the need for external speaker encoders, utilizing speech tokenizers and autoregressive Transformers.

Findings

01

Higher speaker similarity than existing methods

02

Maintains naturalness comparable to top zero-shot approaches

03

Enhances privacy protection and voice anonymization

Abstract

Most current zero-shot voice conversion methods rely on externally supervised components, particularly speaker encoders, for training. To explore alternatives that eliminate this dependency, this paper introduces GenVC, a novel framework that disentangles speaker identity and linguistic content from speech signals in a self-supervised manner. GenVC leverages speech tokenizers and an autoregressive, Transformer-based language model as its backbone for speech generation. This design supports large-scale training while enhancing both source speaker privacy protection and target speaker cloning fidelity. Experimental results demonstrate that GenVC achieves notably higher speaker similarity, with naturalness on par with leading zero-shot approaches. Moreover, due to its autoregressive formulation, GenVC introduces flexibility in temporal alignment, reducing the preservation of source prosody…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
ZexinCai/GenVC
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing