InstructTTS: Modelling Expressive TTS in Discrete Latent Space with   Natural Language Style Prompt

Dongchao Yang; Songxiang Liu; Rongjie Huang; Chao Weng; Helen Meng

arXiv:2301.13662·cs.SD·June 27, 2023·5 cites

InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt

Dongchao Yang, Songxiang Liu, Rongjie Huang, Chao Weng, Helen Meng

PDF

Open Access 1 Repo

TL;DR

InstructTTS introduces a novel expressive TTS system that uses natural language style prompts, a new speech corpus, and a discrete latent space diffusion model to generate diverse, controllable speech styles effectively.

Contribution

The paper presents a new three-stage training process, a discrete latent space diffusion model, and a style control method using natural language prompts, advancing expressive TTS capabilities.

Findings

01

Effective style control with natural language prompts

02

Robust sentence embedding capturing style semantics

03

Generation of diverse expressive speech samples

Abstract

Expressive text-to-speech (TTS) aims to synthesize different speaking style speech according to human's demands. Nowadays, there are two common ways to control speaking styles: (1) Pre-defining a group of speaking style and using categorical index to denote different speaking style. However, there are limitations in the diversity of expressiveness, as these models can only generate the pre-defined styles. (2) Using reference speech as style input, which results in a problem that the extracted style information is not intuitive or interpretable. In this study, we attempt to use natural language as style prompt to control the styles in the synthetic speech, e.g., "Sigh tone in full of sad mood with some helpless feeling". Considering that there is no existing TTS corpus which is proper to benchmark this novel task, we first construct a speech corpus, whose speech samples are annotated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yangdongchao/academicodec
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing