Contextual Expressive Text-to-Speech

Jianhong Tu; Zeyu Cui; Xiaohuan Zhou; Siqi Zheng; Kai Hu; Ju Fan,; Chang Zhou

arXiv:2211.14548·eess.AS·November 29, 2022

Contextual Expressive Text-to-Speech

Jianhong Tu, Zeyu Cui, Xiaohuan Zhou, Siqi Zheng, Kai Hu, Ju Fan,, Chang Zhou

PDF

Open Access

TL;DR

This paper introduces Contextual TTS, a new approach that synthesizes expressive speech based on textual context rather than predefined style labels, improving naturalness and expressiveness.

Contribution

It proposes the CTTS task setting, constructs a synthetic dataset, and develops a framework that leverages context for more natural and expressive speech synthesis.

Findings

01

Framework generates high-quality expressive speech from context

02

Effective on both synthetic and real-world data

03

Outperforms label-based style control methods

Abstract

The goal of expressive Text-to-speech (TTS) is to synthesize natural speech with desired content, prosody, emotion, or timbre, in high expressiveness. Most of previous studies attempt to generate speech from given labels of styles and emotions, which over-simplifies the problem by classifying styles and emotions into a fixed number of pre-defined categories. In this paper, we introduce a new task setting, Contextual TTS (CTTS). The main idea of CTTS is that how a person speaks depends on the particular context she is in, where the context can typically be represented as text. Thus, in the CTTS task, we propose to utilize such context to guide the speech synthesis process instead of relying on explicit labels of styles and emotions. To achieve this task, we construct a synthetic dataset and develop an effective framework. Experiments show that our framework can generate high-quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Music and Audio Processing