Vevo2: A Unified and Controllable Framework for Speech and Singing Voice Generation

Xueyao Zhang; Junan Zhang; Yuancheng Wang; Chaoren Wang; Yuanzhe Chen; Dongya Jia; Zhuo Chen; Zhizheng Wu

arXiv:2508.16332·cs.SD·March 6, 2026

Vevo2: A Unified and Controllable Framework for Speech and Singing Voice Generation

Xueyao Zhang, Junan Zhang, Yuancheng Wang, Chaoren Wang, Yuanzhe Chen, Dongya Jia, Zhuo Chen, Zhizheng Wu

PDF

Open Access 1 Models 1 Datasets

TL;DR

Vevo2 is a versatile framework that enables controllable speech and singing voice generation by using unified tokenizers and auto-regressive modeling, addressing data scarcity and enhancing flexibility.

Contribution

It introduces two unified audio tokenizers and a joint training approach that significantly improve controllability and quality in speech and singing voice synthesis.

Findings

01

Effective in both speech and singing voice generation

02

Enhances controllability over text, prosody, and style

03

Demonstrates strong generalization across tasks

Abstract

Controllable human voice generation, particularly for expressive domains like singing, remains a significant challenge. This paper introduces Vevo2, a unified framework for controllable speech and singing voice generation. To tackle issues like the scarcity of annotated singing data and to enable flexible controllability, Vevo2 introduces two audio tokenizers: (1) a unified music-notation-free prosody tokenizer that captures prosody and melody from speech, singing, and even instrumental sounds, and (2) a unified content-style tokenizer that encodes linguistic content, prosody, and style for both speech and singing, while enabling timbre disentanglement. Vevo2 consists of an auto-regressive (AR) content-style modeling stage, which aims to enable controllability over text, prosody, and style, as well as a flow-matching acoustic modeling stage that allows for timbre control. Particularly,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
RMSnow/Vevo2
model· ♡ 3
♡ 3

Datasets

lestervioleta/svcc2025
dataset· 4 dl
4 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Phonetics and Phonology Research