PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural   Language Descriptions

Guanghou Liu; Yongmao Zhang; Yi Lei; Yunlin Chen; Rui Wang; Zhifei Li,; Lei Xie

arXiv:2305.19522·cs.SD·June 2, 2023·2 cites

PromptStyle: Controllable Style Transfer for Text-to-Speech with Natural Language Descriptions

Guanghou Liu, Yongmao Zhang, Yi Lei, Yunlin Chen, Rui Wang, Zhifei Li,, Lei Xie

PDF

Open Access

TL;DR

PromptStyle enables controllable text-to-speech style transfer guided by natural language descriptions, allowing users to specify styles via text without needing reference speech, thus broadening practical applications.

Contribution

This work introduces PromptStyle, a novel system combining an improved VITS and a cross-modal style encoder for text-guided style transfer in TTS, which was not previously achievable.

Findings

01

Achieves effective style transfer guided by text prompts

02

Maintains high speaker similarity and stability

03

Demonstrates practical applicability with natural language control

Abstract

Style transfer TTS has shown impressive performance in recent years. However, style control is often restricted to systems built on expressive speech recordings with discrete style categories. In practical situations, users may be interested in transferring style by typing text descriptions of desired styles, without the reference speech in the target style. The text-guided content generation techniques have drawn wide attention recently. In this work, we explore the possibility of controllable style transfer with natural language descriptions. To this end, we propose PromptStyle, a text prompt-guided cross-speaker style transfer system. Specifically, PromptStyle consists of an improved VITS and a cross-modal style encoder. The cross-modal style encoder constructs a shared space of stylistic and semantic representation through a two-stage training process. Experiments show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing