Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling
Rui Liu, Zhenqi Jia, Jie Yang, Yifan Hu, Haizhou Li

TL;DR
This paper introduces ER-CTTS, a novel emphasis rendering approach for conversational TTS that models multi-modal, multi-scale context to improve emphasis expression, addressing data scarcity and enhancing speech naturalness.
Contribution
The paper proposes a new emphasis rendering scheme for CTTS that integrates textual and acoustic contexts at multiple scales, improving emphasis modeling in conversational speech synthesis.
Findings
Outperforms baseline models in emphasis rendering
Effectively models multi-modal and multi-scale context influences
Addresses data scarcity with annotated emphasis intensity
Abstract
Conversational Text-to-Speech (CTTS) aims to accurately express an utterance with the appropriate style within a conversational setting, which attracts more attention nowadays. While recognizing the significance of the CTTS task, prior studies have not thoroughly investigated speech emphasis expression, which is essential for conveying the underlying intention and attitude in human-machine interaction scenarios, due to the scarcity of conversational emphasis datasets and the difficulty in context understanding. In this paper, we propose a novel Emphasis Rendering scheme for the CTTS model, termed ER-CTTS, that includes two main components: 1) we simultaneously take into account textual and acoustic contexts, with both global and local semantic modeling to understand the conversation context comprehensively; 2) we deeply integrate multi-modal and multi-scale context to learn the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques
MethodsSoftmax · Attention Is All You Need
