Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis
Zhenqi Jia, Rui Liu

TL;DR
This paper introduces III-CSS, a novel conversational speech synthesis system that explicitly models intra- and inter-modal context interactions using contrastive learning, significantly improving prosody expressiveness in generated speech.
Contribution
The paper proposes a new interaction scheme-based CSS system that explicitly models intra- and inter-modal context interactions with contrastive learning modules, enhancing speech prosody.
Findings
Outperforms advanced baselines in prosody expressiveness
Uses contrastive learning for intra- and inter-modal interaction modeling
Effective on the DailyTalk dataset
Abstract
Conversational Speech Synthesis (CSS) aims to effectively take the multimodal dialogue history (MDH) to generate speech with appropriate conversational prosody for target utterance. The key challenge of CSS is to model the interaction between the MDH and the target utterance. Note that text and speech modalities in MDH have their own unique influences, and they complement each other to produce a comprehensive impact on the target utterance. Previous works did not explicitly model such intra-modal and inter-modal interactions. To address this issue, we propose a new intra-modal and inter-modal context interaction scheme-based CSS system, termed III-CSS. Specifically, in the training phase, we combine the MDH with the text and speech modalities in the target utterance to obtain four modal combinations, including Historical Text-Next Text, Historical Speech-Next Speech, Historical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
MethodsADaptive gradient method with the OPTimal convergence rate
