Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis   with Graph-based Multi-modal Context Modeling

Jingbei Li; Yi Meng; Chenyi Li; Zhiyong Wu; Helen Meng; Chao Weng and; Dan Su

arXiv:2106.06233·cs.SD·April 1, 2022·1 cites

Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-based Multi-modal Context Modeling

Jingbei Li, Yi Meng, Chenyi Li, Zhiyong Wu, Helen Meng, Chao Weng and, Dan Su

PDF

Open Access 2 Repos

TL;DR

This paper introduces a graph-based multi-modal context modeling approach for conversational TTS that captures inter- and intra-speaker influences, leading to more natural speaking styles in synthesized speech.

Contribution

It proposes a novel DialogueGCN-based method for context modeling in conversational TTS, improving style consistency and naturalness over RNN-based approaches.

Findings

01

Outperforms state-of-the-art methods in MOS scores

02

Achieves higher ABX preference rates

03

Effectively models inter- and intra-speaker influences

Abstract

Comparing with traditional text-to-speech (TTS) systems, conversational TTS systems are required to synthesize speeches with proper speaking style confirming to the conversational context. However, state-of-the-art context modeling methods in conversational TTS only model the textual information in context with a recurrent neural network (RNN). Such methods have limited ability in modeling the inter-speaker influence in conversations, and also neglect the speaking styles and the intra-speaker inertia inside each speaker. Inspired by DialogueGCN and its superiority in modeling such conversational influences than RNN based approaches, we propose a graph-based multi-modal context modeling method and adopt it to conversational TTS to enhance the speaking styles of synthesized speeches. Both the textual and speaking style information in the context are extracted and processed by DialogueGCN…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Speech Recognition and Synthesis