Facial Expression-Enhanced TTS: Combining Face Representation and   Emotion Intensity for Adaptive Speech

Yunji Chu; Yunseob Shim; and Unsang Park

arXiv:2409.16203·cs.SD·September 25, 2024

Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech

Yunji Chu, Yunseob Shim, and Unsang Park

PDF

Open Access

TL;DR

FEIM-TTS is a zero-shot, facial expression-aware TTS model that synthesizes emotionally expressive speech aligned with facial cues and emotion intensity, enhancing accessibility and virtual character voice adaptability.

Contribution

It introduces a novel zero-shot TTS framework that integrates facial representations and emotion intensity without relying on labeled datasets.

Findings

01

Successfully synthesizes emotionally expressive speech aligned with facial cues.

02

Demonstrates adaptability across diverse datasets and speakers.

03

Enhances accessibility for visually impaired users.

Abstract

We propose FEIM-TTS, an innovative zero-shot text-to-speech (TTS) model that synthesizes emotionally expressive speech, aligned with facial images and modulated by emotion intensity. Leveraging deep learning, FEIM-TTS transcends traditional TTS systems by interpreting facial cues and adjusting to emotional nuances without dependence on labeled datasets. To address sparse audio-visual-emotional data, the model is trained using LRS3, CREMA-D, and MELD datasets, demonstrating its adaptability. FEIM-TTS's unique capability to produce high-quality, speaker-agnostic speech makes it suitable for creating adaptable voices for virtual characters. Moreover, FEIM-TTS significantly enhances accessibility for individuals with visual impairments or those who have trouble seeing. By integrating emotional nuances into TTS, our model enables dynamic and engaging auditory experiences for webcomics,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEmotion and Mood Recognition · Speech and Audio Processing · Face recognition and analysis