InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt
Dongchao Yang, Songxiang Liu, Rongjie Huang, Chao Weng, Helen Meng

TL;DR
InstructTTS introduces a novel expressive TTS system that uses natural language style prompts, a new speech corpus, and a discrete latent space diffusion model to generate diverse, controllable speech styles effectively.
Contribution
The paper presents a new three-stage training process, a discrete latent space diffusion model, and a style control method using natural language prompts, advancing expressive TTS capabilities.
Findings
Effective style control with natural language prompts
Robust sentence embedding capturing style semantics
Generation of diverse expressive speech samples
Abstract
Expressive text-to-speech (TTS) aims to synthesize different speaking style speech according to human's demands. Nowadays, there are two common ways to control speaking styles: (1) Pre-defining a group of speaking style and using categorical index to denote different speaking style. However, there are limitations in the diversity of expressiveness, as these models can only generate the pre-defined styles. (2) Using reference speech as style input, which results in a problem that the extracted style information is not intuitive or interpretable. In this study, we attempt to use natural language as style prompt to control the styles in the synthetic speech, e.g., "Sigh tone in full of sad mood with some helpless feeling". Considering that there is no existing TTS corpus which is proper to benchmark this novel task, we first construct a speech corpus, whose speech samples are annotated…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
