Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous   Behaviors Based on Language Models

Weiqin Li; Peiji Yang; Yicheng Zhong; Yixuan Zhou; Zhisheng Wang,; Zhiyong Wu; Xixin Wu; Helen Meng

arXiv:2407.13509·cs.SD·July 19, 2024

Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models

Weiqin Li, Peiji Yang, Yicheng Zhong, Yixuan Zhou, Zhisheng Wang,, Zhiyong Wu, Xixin Wu, Helen Meng

PDF

Open Access

TL;DR

This paper introduces a novel language model-based spontaneous speech synthesis system that effectively captures diverse spontaneous behaviors and prosody variations, significantly improving naturalness in synthesized spontaneous speech.

Contribution

It systematically categorizes spontaneous behaviors and incorporates fine-grained prosody modeling, advancing the realism of spontaneous speech synthesis.

Findings

01

Outperforms baseline in prosody naturalness

02

Enhances spontaneous behavior realism

03

Captures subtle prosody variations

Abstract

Spontaneous style speech synthesis, which aims to generate human-like speech, often encounters challenges due to the scarcity of high-quality data and limitations in model capabilities. Recent language model-based TTS systems can be trained on large, diverse, and low-quality speech datasets, resulting in highly natural synthesized speech. However, they are limited by the difficulty of simulating various spontaneous behaviors and capturing prosody variations in spontaneous speech. In this paper, we propose a novel spontaneous speech synthesis system based on language models. We systematically categorize and uniformly model diverse spontaneous behaviors. Moreover, fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech.Experimental results show that our proposed method significantly outperforms the baseline…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Topic Modeling