Facial Expression-Enhanced TTS: Combining Face Representation and Emotion Intensity for Adaptive Speech
Yunji Chu, Yunseob Shim, and Unsang Park

TL;DR
FEIM-TTS is a zero-shot, facial expression-aware TTS model that synthesizes emotionally expressive speech aligned with facial cues and emotion intensity, enhancing accessibility and virtual character voice adaptability.
Contribution
It introduces a novel zero-shot TTS framework that integrates facial representations and emotion intensity without relying on labeled datasets.
Findings
Successfully synthesizes emotionally expressive speech aligned with facial cues.
Demonstrates adaptability across diverse datasets and speakers.
Enhances accessibility for visually impaired users.
Abstract
We propose FEIM-TTS, an innovative zero-shot text-to-speech (TTS) model that synthesizes emotionally expressive speech, aligned with facial images and modulated by emotion intensity. Leveraging deep learning, FEIM-TTS transcends traditional TTS systems by interpreting facial cues and adjusting to emotional nuances without dependence on labeled datasets. To address sparse audio-visual-emotional data, the model is trained using LRS3, CREMA-D, and MELD datasets, demonstrating its adaptability. FEIM-TTS's unique capability to produce high-quality, speaker-agnostic speech makes it suitable for creating adaptable voices for virtual characters. Moreover, FEIM-TTS significantly enhances accessibility for individuals with visual impairments or those who have trouble seeing. By integrating emotional nuances into TTS, our model enables dynamic and engaging auditory experiences for webcomics,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech and Audio Processing · Face recognition and analysis
