Intra- and Inter-modal Context Interaction Modeling for Conversational   Speech Synthesis

Zhenqi Jia; Rui Liu

arXiv:2412.18733·cs.CL·December 30, 2024

Intra- and Inter-modal Context Interaction Modeling for Conversational Speech Synthesis

Zhenqi Jia, Rui Liu

PDF

Open Access

TL;DR

This paper introduces III-CSS, a novel conversational speech synthesis system that explicitly models intra- and inter-modal context interactions using contrastive learning, significantly improving prosody expressiveness in generated speech.

Contribution

The paper proposes a new interaction scheme-based CSS system that explicitly models intra- and inter-modal context interactions with contrastive learning modules, enhancing speech prosody.

Findings

01

Outperforms advanced baselines in prosody expressiveness

02

Uses contrastive learning for intra- and inter-modal interaction modeling

03

Effective on the DailyTalk dataset

Abstract

Conversational Speech Synthesis (CSS) aims to effectively take the multimodal dialogue history (MDH) to generate speech with appropriate conversational prosody for target utterance. The key challenge of CSS is to model the interaction between the MDH and the target utterance. Note that text and speech modalities in MDH have their own unique influences, and they complement each other to produce a comprehensive impact on the target utterance. Previous works did not explicitly model such intra-modal and inter-modal interactions. To address this issue, we propose a new intra-modal and inter-modal context interaction scheme-based CSS system, termed III-CSS. Specifically, in the training phase, we combine the MDH with the text and speech modalities in the target utterance to obtain four modal combinations, including Historical Text-Next Text, Historical Speech-Next Speech, Historical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems

MethodsADaptive gradient method with the OPTimal convergence rate