Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement
Xueyao Zhang, Xiaohui Zhang, Kainan Peng, Zhenyu Tang, Vimal Manohar,, Yingru Liu, Jeff Hwang, Dangna Li, Yuhao Wang, Julian Chan, Yuan Huang,, Zhizheng Wu, Mingbo Ma

TL;DR
Vevo is a self-supervised framework for zero-shot voice imitation that disentangles speech attributes to enable controllable and versatile voice conversion without requiring annotated data.
Contribution
It introduces a fully self-supervised approach for disentangling speech attributes and a two-stage model for controllable zero-shot voice imitation, surpassing existing methods.
Findings
Matches or surpasses existing methods in accent and emotion conversion
Effective zero-shot voice conversion and TTS demonstrated
Operates without fine-tuning on style-specific data
Abstract
The imitation of voice, targeted on specific speech attributes such as timbre and speaking style, is crucial in speech generation. However, existing methods rely heavily on annotated data, and struggle with effectively disentangling timbre and style, leading to challenges in achieving controllable generation, especially in zero-shot scenarios. To address these issues, we propose Vevo, a versatile zero-shot voice imitation framework with controllable timbre and style. Vevo operates in two core stages: (1) Content-Style Modeling: Given either text or speech's content tokens as input, we utilize an autoregressive transformer to generate the content-style tokens, which is prompted by a style reference; (2) Acoustic Modeling: Given the content-style tokens as input, we employ a flow-matching transformer to produce acoustic representations, which is prompted by a timbre reference. To obtain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
MethodsVQ-VAE · ADaptive gradient method with the OPTimal convergence rate
