Controllable Context-aware Conversational Speech Synthesis
Jian Cong, Shan Yang, Na Hu, Guangzhi Li, Lei Xie, Dan Su

TL;DR
This paper introduces a controllable speech synthesis framework that models spontaneous conversational behaviors and speech entrainment, enabling more human-like and natural dialogue generation.
Contribution
It presents a unified neural network-based system that explicitly models spontaneous behaviors and uses adversarial training to simulate speech entrainment in conversations.
Findings
The system can control the frequency of disfluencies in synthesized speech.
It effectively models speech entrainment at the acoustic level.
Synthesized conversations are more realistic and natural.
Abstract
In spoken conversations, spontaneous behaviors like filled pause and prolongations always happen. Conversational partner tends to align features of their speech with their interlocutor which is known as entrainment. To produce human-like conversations, we propose a unified controllable spontaneous conversational speech synthesis framework to model the above two phenomena. Specifically, we use explicit labels to represent two typical spontaneous behaviors filled-pause and prolongation in the acoustic model and develop a neural network based predictor to predict the occurrences of the two behaviors from text. We subsequently develop an algorithm based on the predictor to control the occurrence frequency of the behaviors, making the synthesized speech vary from less disfluent to more disfluent. To model the speech entrainment at acoustic level, we utilize a context acoustic encoder to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
