EZ-VC: Easy Zero-shot Any-to-Any Voice Conversion

Advait Joglekar; Divyanshu Singh; Rooshil Rohit Bhatia; S. Umesh

arXiv:2505.16691·cs.SD·May 26, 2025

EZ-VC: Easy Zero-shot Any-to-Any Voice Conversion

Advait Joglekar, Divyanshu Singh, Rooshil Rohit Bhatia, S. Umesh

PDF

1 Video

TL;DR

EZ-VC introduces a simple, self-supervised, zero-shot voice conversion method that effectively handles cross-lingual and unseen speaker scenarios without multiple encoders, using a diffusion-transformer based architecture.

Contribution

The paper presents a novel, textless, self-supervised voice conversion approach combining discrete speech representations with a diffusion-transformer decoder, improving zero-shot cross-lingual performance.

Findings

01

Effective zero-shot cross-lingual voice conversion.

02

No need for multiple encoders to disentangle speech features.

03

Outperforms existing methods in unseen speaker scenarios.

Abstract

Voice Conversion research in recent times has increasingly focused on improving the zero-shot capabilities of existing methods. Despite remarkable advancements, current architectures still tend to struggle in zero-shot cross-lingual settings. They are also often unable to generalize for speakers of unseen languages and accents. In this paper, we adopt a simple yet effective approach that combines discrete speech representations from self-supervised models with a non-autoregressive Diffusion-Transformer based conditional flow matching speech decoder. We show that this architecture allows us to train a voice-conversion model in a purely textless, self-supervised fashion. Our technique works without requiring multiple encoders to disentangle speech features. Our model also manages to excel in zero-shot cross-lingual settings even for unseen languages. For Demo:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

EZ-VC: Easy Zero-shot Any-to-Any Voice Conversion· underline

Taxonomy

MethodsADaptive gradient method with the OPTimal convergence rate