Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling
Rui Liu, Yifan Hu, Yi Ren, Xiang Yin, Haizhou Li

TL;DR
This paper introduces ECSS, a novel conversational speech synthesis model that leverages heterogeneous graph-based context modeling and contrastive learning to improve emotional expressiveness in dialogue, addressing data scarcity and emotion modeling challenges.
Contribution
The paper proposes a new ECSS model with heterogeneous graph-based context modeling and contrastive emotion rendering, enhancing emotional expressiveness in conversational speech synthesis.
Findings
Outperforms baseline models in emotion understanding and rendering
Effective handling of emotional data scarcity through detailed annotations
Demonstrates the importance of comprehensive emotional labels
Abstract
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting. While recognising the significance of CSS task, the prior studies have not thoroughly investigated the emotional expressiveness problems due to the scarcity of emotional conversational datasets and the difficulty of stateful emotion modeling. In this paper, we propose a novel emotional CSS model, termed ECSS, that includes two main components: 1) to enhance emotion understanding, we introduce a heterogeneous graph-based emotional context modeling mechanism, which takes the multi-source dialogue history as input to model the dialogue context and learn the emotion cues from the context; 2) to achieve emotion rendering, we employ a contrastive learning-based emotion renderer module to infer the accurate emotion style for the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Sentiment Analysis and Opinion Mining · Speech Recognition and Synthesis
