DreamVoice: Text-Guided Voice Conversion

Jiarui Hai; Karan Thakkar; Helin Wang; Zengyi Qin; Mounya Elhilali

arXiv:2406.16314·eess.AS·June 25, 2024·Interspeech

DreamVoice: Text-Guided Voice Conversion

Jiarui Hai, Karan Thakkar, Helin Wang, Zengyi Qin, Mounya Elhilali

PDF

Open Access 1 Models

TL;DR

DreamVoice introduces a new dataset and two text-guided voice conversion methods that enable intuitive, high-quality voice transformation aligned with textual prompts, enhancing personalization and ease of use.

Contribution

The paper presents DreamVoiceDB dataset and two novel text-guided VC methods, DreamVC and DreamVG, advancing voice conversion technology with text-based control.

Findings

01

High-quality voice conversion aligned with text prompts

02

Effective voice timbre generation for 900 speakers

03

Versatile plugin compatible with existing VC models

Abstract

Generative voice technologies are rapidly evolving, offering opportunities for more personalized and inclusive experiences. Traditional one-shot voice conversion (VC) requires a target recording during inference, limiting ease of usage in generating desired voice timbres. Text-guided generation offers an intuitive solution to convert voices to desired "DreamVoices" according to the users' needs. Our paper presents two major contributions to VC technology: (1) DreamVoiceDB, a robust dataset of voice timbre annotations for 900 speakers from VCTK and LibriTTS. (2) Two text-guided VC methods: DreamVC, an end-to-end diffusion-based text-guided VC model; and DreamVG, a versatile text-to-voice generation plugin that can be combined with any one-shot VC models. The experimental results demonstrate that our proposed methods trained on the DreamVoiceDB dataset generate voice timbres accurately…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
myshell-ai/DreamVoice
model· ♡ 30
♡ 30

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis