Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised   Disentanglement

Xueyao Zhang; Xiaohui Zhang; Kainan Peng; Zhenyu Tang; Vimal Manohar,; Yingru Liu; Jeff Hwang; Dangna Li; Yuhao Wang; Julian Chan; Yuan Huang,; Zhizheng Wu; Mingbo Ma

arXiv:2502.07243·cs.SD·March 30, 2025

Vevo: Controllable Zero-Shot Voice Imitation with Self-Supervised Disentanglement

Xueyao Zhang, Xiaohui Zhang, Kainan Peng, Zhenyu Tang, Vimal Manohar,, Yingru Liu, Jeff Hwang, Dangna Li, Yuhao Wang, Julian Chan, Yuan Huang,, Zhizheng Wu, Mingbo Ma

PDF

Open Access

TL;DR

Vevo is a self-supervised framework for zero-shot voice imitation that disentangles speech attributes to enable controllable and versatile voice conversion without requiring annotated data.

Contribution

It introduces a fully self-supervised approach for disentangling speech attributes and a two-stage model for controllable zero-shot voice imitation, surpassing existing methods.

Findings

01

Matches or surpasses existing methods in accent and emotion conversion

02

Effective zero-shot voice conversion and TTS demonstrated

03

Operates without fine-tuning on style-specific data

Abstract

The imitation of voice, targeted on specific speech attributes such as timbre and speaking style, is crucial in speech generation. However, existing methods rely heavily on annotated data, and struggle with effectively disentangling timbre and style, leading to challenges in achieving controllable generation, especially in zero-shot scenarios. To address these issues, we propose Vevo, a versatile zero-shot voice imitation framework with controllable timbre and style. Vevo operates in two core stages: (1) Content-Style Modeling: Given either text or speech's content tokens as input, we utilize an autoregressive transformer to generate the content-style tokens, which is prompted by a style reference; (2) Acoustic Modeling: Given the content-style tokens as input, we employ a flow-matching transformer to produce acoustic representations, which is prompted by a timbre reference. To obtain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing

MethodsVQ-VAE · ADaptive gradient method with the OPTimal convergence rate