PromptSpeaker: Speaker Generation Based on Text Descriptions

Yongmao Zhang; Guanghou Liu; Yi Lei; Yunlin Chen; Hao Yin; Lei Xie,; Zhifei Li

arXiv:2310.05001·cs.SD·October 10, 2023

PromptSpeaker: Speaker Generation Based on Text Descriptions

Yongmao Zhang, Guanghou Liu, Yi Lei, Yunlin Chen, Hao Yin, Lei Xie,, Zhifei Li

PDF

Open Access

TL;DR

PromptSpeaker is a novel system that generates speaker voices from text descriptions using a combination of prompt encoding, a Glow model, and zero-shot voice synthesis, enabling new speaker creation without training on specific speakers.

Contribution

It introduces a new text-guided speaker generation framework combining prompt encoding, Glow, and zero-shot VITS for flexible speaker synthesis from descriptions.

Findings

01

Successfully generates new speakers from text prompts.

02

Achieves reasonable subjective quality in voice matching.

03

Demonstrates effectiveness with objective metrics.

Abstract

Recently, text-guided content generation has received extensive attention. In this work, we explore the possibility of text description-based speaker generation, i.e., using text prompts to control the speaker generation process. Specifically, we propose PromptSpeaker, a text-guided speaker generation system. PromptSpeaker consists of a prompt encoder, a zero-shot VITS, and a Glow model, where the prompt encoder predicts a prior distribution based on the text description and samples from this distribution to obtain a semantic representation. The Glow model subsequently converts the semantic representation into a speaker representation, and the zero-shot VITS finally synthesizes the speaker's voice based on the speaker representation. We verify that PromptSpeaker can generate speakers new from the training set by objective metrics, and the synthetic speaker voice has reasonable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech and dialogue systems