Emphasis Rendering for Conversational Text-to-Speech with Multi-modal   Multi-scale Context Modeling

Rui Liu; Zhenqi Jia; Jie Yang; Yifan Hu; Haizhou Li

arXiv:2410.09524·cs.CL·October 15, 2024

Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling

Rui Liu, Zhenqi Jia, Jie Yang, Yifan Hu, Haizhou Li

PDF

Open Access

TL;DR

This paper introduces ER-CTTS, a novel emphasis rendering approach for conversational TTS that models multi-modal, multi-scale context to improve emphasis expression, addressing data scarcity and enhancing speech naturalness.

Contribution

The paper proposes a new emphasis rendering scheme for CTTS that integrates textual and acoustic contexts at multiple scales, improving emphasis modeling in conversational speech synthesis.

Findings

01

Outperforms baseline models in emphasis rendering

02

Effectively models multi-modal and multi-scale context influences

03

Addresses data scarcity with annotated emphasis intensity

Abstract

Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the appropriate style within a conversational setting, which attracts more attention nowadays. While recognizing the significance of the CTTS task, prior studies have not thoroughly investigated speech emphasis expression, which is essential for conveying the underlying intention and attitude in human-machine interaction scenarios, due to the scarcity of conversational emphasis datasets and the difficulty in context understanding. In this paper, we propose a novel Emphasis Rendering scheme for the CTTS model, termed ER-CTTS, that includes two main components: 1) we simultaneously take into account textual and acoustic contexts, with both global and local semantic modeling to understand the conversation context comprehensively; 2) we deeply integrate multi-modal and multi-scale context to learn the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Natural Language Processing Techniques

MethodsSoftmax · Attention Is All You Need