Towards Spontaneous Style Modeling with Semi-supervised Pre-training for   Conversational Text-to-Speech Synthesis

Weiqin Li; Shun Lei; Qiaochu Huang; Yixuan Zhou; Zhiyong Wu; Shiyin; Kang; Helen Meng

arXiv:2308.16593·cs.SD·September 1, 2023

Towards Spontaneous Style Modeling with Semi-supervised Pre-training for Conversational Text-to-Speech Synthesis

Weiqin Li, Shun Lei, Qiaochu Huang, Yixuan Zhou, Zhiyong Wu, Shiyin, Kang, Helen Meng

PDF

Open Access

TL;DR

This paper introduces a semi-supervised pre-training approach for conversational Text-to-Speech synthesis that enhances spontaneous speech modeling by leveraging both labeled and unlabeled data, resulting in more human-like and expressive speech.

Contribution

It presents a novel semi-supervised learning framework with a linguistic-aware encoder to improve spontaneous speech synthesis and behavior prediction from limited labeled data.

Findings

01

Achieves superior expressive speech synthesis performance.

02

Effectively models spontaneous behavior in speech.

03

Predicts spontaneous behavior from text accurately.

Abstract

The spontaneous behavior that often occurs in conversations makes speech more human-like compared to reading-style. However, synthesizing spontaneous-style speech is challenging due to the lack of high-quality spontaneous datasets and the high cost of labeling spontaneous behavior. In this paper, we propose a semi-supervised pre-training method to increase the amount of spontaneous-style speech and spontaneous behavioral labels. In the process of semi-supervised learning, both text and speech information are considered for detecting spontaneous behaviors labels in speech. Moreover, a linguistic-aware encoder is used to model the relationship between each sentence in the conversation. Experimental results indicate that our proposed method achieves superior expressive speech synthesis performance with the ability to model spontaneous behavior in spontaneous-style speech and predict…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems