UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with   Multimodal Prompts

Zhi-Qi Cheng; Xiang Li; Jun-Yan He; Junyao Chen; Xiaomao Fan,; Xiaojiang Peng; Alexander G. Hauptmann

arXiv:2404.18398·cs.CL·February 20, 2025

UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts

Zhi-Qi Cheng, Xiang Li, Jun-Yan He, Junyao Chen, Xiaomao Fan,, Xiaojiang Peng, Alexander G. Hauptmann

PDF

Open Access 1 Repo

TL;DR

UMETTS introduces a multimodal framework for emotional TTS that aligns emotional cues from text, audio, and visual inputs to produce more expressive and emotionally accurate speech, surpassing existing methods.

Contribution

The paper presents UMETTS, a novel multimodal emotional TTS framework with contrastive learning for emotion alignment and improved speech synthesis quality.

Findings

01

Enhanced emotion accuracy in synthesized speech

02

Improved speech naturalness over traditional E-TTS

03

Effective multimodal emotional cue integration

Abstract

Emotional Text-to-Speech (E-TTS) synthesis has garnered significant attention in recent years due to its potential to revolutionize human-computer interaction. However, current E-TTS approaches often struggle to capture the intricacies of human emotions, primarily relying on oversimplified emotional labels or single-modality input. In this paper, we introduce the Unified Multimodal Prompt-Induced Emotional Text-to-Speech System (UMETTS), a novel framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech. The core of UMETTS consists of two key components: the Emotion Prompt Alignment Module (EP-Align) and the Emotion Embedding-Induced TTS Module (EMI-TTS). (1) EP-Align employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kttrcdl/umetts
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems

MethodsContrastive Learning · ALIGN