Conversational End-to-End TTS for Voice Agent
Haohan Guo, Shaofei Zhang, Frank K. Soong, Lei He, Lei Xie

TL;DR
This paper presents a new conversational end-to-end TTS system that incorporates conversational context to produce more natural and spontaneous speech for voice agents, addressing limitations of previous models.
Contribution
It introduces a conversation context-aware end-to-end TTS model with a novel corpus and encoding scheme, enhancing naturalness and spontaneous behaviors in speech synthesis.
Findings
Produced more natural prosody aligned with conversational context
Achieved significant preference gains in listening tests
Model expressed spontaneous behaviors like fillers and repetitions
Abstract
End-to-end neural TTS has achieved superior performance on reading style speech synthesis. However, it's still a challenge to build a high-quality conversational TTS due to the limitations of the corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework. We firstly construct a spontaneous conversational speech corpus well designed for the voice agent with a new recording scheme ensuring both recording quality and conversational speaking style. Secondly, we propose a conversation context-aware end-to-end TTS approach which has an auxiliary encoder and a conversational context encoder to reinforce the information about the current utterance and its context in a conversation as well. Experimental results show that the proposed methods produce more natural prosody in accordance with the conversational…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
