UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts
Zhi-Qi Cheng, Xiang Li, Jun-Yan He, Junyao Chen, Xiaomao Fan,, Xiaojiang Peng, Alexander G. Hauptmann

TL;DR
UMETTS introduces a multimodal framework for emotional TTS that aligns emotional cues from text, audio, and visual inputs to produce more expressive and emotionally accurate speech, surpassing existing methods.
Contribution
The paper presents UMETTS, a novel multimodal emotional TTS framework with contrastive learning for emotion alignment and improved speech synthesis quality.
Findings
Enhanced emotion accuracy in synthesized speech
Improved speech naturalness over traditional E-TTS
Effective multimodal emotional cue integration
Abstract
Emotional Text-to-Speech (E-TTS) synthesis has garnered significant attention in recent years due to its potential to revolutionize human-computer interaction. However, current E-TTS approaches often struggle to capture the intricacies of human emotions, primarily relying on oversimplified emotional labels or single-modality input. In this paper, we introduce the Unified Multimodal Prompt-Induced Emotional Text-to-Speech System (UMETTS), a novel framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. The core of UMETTS consists of two key components: the Emotion Prompt Alignment Module (EP-Align) and the Emotion Embedding-Induced TTS Module (EMI-TTS). (1) EP-Align employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems
MethodsContrastive Learning · ALIGN
