EME-TTS: Unlocking the Emphasis and Emotion Link in Speech Synthesis
Haoxun Li, Leyuan Qu, Jiaxi Hu, Taihao Li

TL;DR
EME-TTS introduces a novel framework that effectively integrates emphasis and emotion in speech synthesis, enhancing expressiveness and stability across emotions using weakly supervised learning and a perception enhancement block.
Contribution
The paper presents a new framework, EME-TTS, that improves emotional speech synthesis by better utilizing emphasis and ensuring perceptual clarity across emotions.
Findings
Enables more natural emotional speech synthesis.
Maintains stable and distinguishable emphasis across emotions.
Uses weakly supervised learning with emphasis pseudo-labels.
Abstract
In recent years, emotional Text-to-Speech (TTS) synthesis and emphasis-controllable speech synthesis have advanced significantly. However, their interaction remains underexplored. We propose Emphasis Meets Emotion TTS (EME-TTS), a novel framework designed to address two key research questions: (1) how to effectively utilize emphasis to enhance the expressiveness of emotional speech, and (2) how to maintain the perceptual clarity and stability of target emphasis across different emotions. EME-TTS employs weakly supervised learning with emphasis pseudo-labels and variance-based emphasis features. Additionally, the proposed Emphasis Perception Enhancement (EPE) block enhances the interaction between emotional signals and emphasis positions. Experimental results show that EME-TTS, when combined with large language models for emphasis position prediction, enables more natural emotional…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Emotion and Mood Recognition
