Conversational End-to-End TTS for Voice Agent

Haohan Guo; Shaofei Zhang; Frank K. Soong; Lei He; Lei Xie

arXiv:2005.10438·cs.SD·November 17, 2020·5 cites

Conversational End-to-End TTS for Voice Agent

Haohan Guo, Shaofei Zhang, Frank K. Soong, Lei He, Lei Xie

PDF

Open Access 2 Repos

TL;DR

This paper presents a new conversational end-to-end TTS system that incorporates conversational context to produce more natural and spontaneous speech for voice agents, addressing limitations of previous models.

Contribution

It introduces a conversation context-aware end-to-end TTS model with a novel corpus and encoding scheme, enhancing naturalness and spontaneous behaviors in speech synthesis.

Findings

01

Produced more natural prosody aligned with conversational context

02

Achieved significant preference gains in listening tests

03

Model expressed spontaneous behaviors like fillers and repetitions

Abstract

End-to-end neural TTS has achieved superior performance on reading style speech synthesis. However, it's still a challenge to build a high-quality conversational TTS due to the limitations of the corpus and modeling capability. This study aims at building a conversational TTS for a voice agent under sequence to sequence modeling framework. We firstly construct a spontaneous conversational speech corpus well designed for the voice agent with a new recording scheme ensuring both recording quality and conversational speaking style. Secondly, we propose a conversation context-aware end-to-end TTS approach which has an auxiliary encoder and a conversational context encoder to reinforce the information about the current utterance and its context in a conversation as well. Experimental results show that the proposed methods produce more natural prosody in accordance with the conversational…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques