TL;DR
EZ-VC introduces a simple, self-supervised, zero-shot voice conversion method that effectively handles cross-lingual and unseen speaker scenarios without multiple encoders, using a diffusion-transformer based architecture.
Contribution
The paper presents a novel, textless, self-supervised voice conversion approach combining discrete speech representations with a diffusion-transformer decoder, improving zero-shot cross-lingual performance.
Findings
Effective zero-shot cross-lingual voice conversion.
No need for multiple encoders to disentangle speech features.
Outperforms existing methods in unseen speaker scenarios.
Abstract
Voice Conversion research in recent times has increasingly focused on improving the zero-shot capabilities of existing methods. Despite remarkable advancements, current architectures still tend to struggle in zero-shot cross-lingual settings. They are also often unable to generalize for speakers of unseen languages and accents. In this paper, we adopt a simple yet effective approach that combines discrete speech representations from self-supervised models with a non-autoregressive Diffusion-Transformer based conditional flow matching speech decoder. We show that this architecture allows us to train a voice-conversion model in a purely textless, self-supervised fashion. Our technique works without requiring multiple encoders to disentangle speech features. Our model also manages to excel in zero-shot cross-lingual settings even for unseen languages. For Demo:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsADaptive gradient method with the OPTimal convergence rate
