CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate   Prosody in Conversational Speech Synthesis

Yayue Deng; Jinlong Xue; Yukang Jia; Qifei Li; Yichen Han; Fengping; Wang; Yingming Gao; Dengfeng Ke; Ya Li

arXiv:2312.10358·cs.CL·December 19, 2023·1 cites

CONCSS: Contrastive-based Context Comprehension for Dialogue-appropriate Prosody in Conversational Speech Synthesis

Yayue Deng, Jinlong Xue, Yukang Jia, Qifei Li, Yichen Han, Fengping, Wang, Yingming Gao, Dengfeng Ke, Ya Li

PDF

Open Access

TL;DR

CONCSS introduces a contrastive learning framework for conversational speech synthesis, significantly improving context understanding and prosody appropriateness in generated speech through self-supervised learning and negative sample augmentation.

Contribution

This paper presents the first integration of contrastive learning into CSS, enhancing context representation and discriminability for more natural dialogue-appropriate prosody.

Findings

01

Enhanced prosody appropriateness in synthesized speech

02

Improved context understanding demonstrated in experiments

03

Effective self-supervised learning on unlabeled data

Abstract

Conversational speech synthesis (CSS) incorporates historical dialogue as supplementary information with the aim of generating speech that has dialogue-appropriate prosody. While previous methods have already delved into enhancing context comprehension, context representation still lacks effective representation capabilities and context-sensitive discriminability. In this paper, we introduce a contrastive learning-based CSS framework, CONCSS. Within this framework, we define an innovative pretext task specific to CSS that enables the model to perform self-supervised learning on unlabeled conversational datasets to boost the model's context understanding. Additionally, we introduce a sampling strategy for negative sample augmentation to enhance context vectors' discriminability. This is the first attempt to integrate contrastive learning into CSS. We conduct ablation studies on different…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Speech Recognition and Synthesis · Topic Modeling

MethodsContrastive Learning