Emotional Face-to-Speech
Jiaxin Ye, Boyuan Cao, Hongming Shan

TL;DR
This paper introduces DEmoFace, a novel generative framework that synthesizes emotional speech directly from facial cues using a diffusion transformer and curriculum learning, advancing emotional voice synthesis technology.
Contribution
The paper presents a new task of emotional face-to-speech synthesis and proposes DEmoFace, a multi-level neural audio codec with a diffusion transformer and curriculum learning for improved emotional speech generation.
Findings
DEmoFace produces more natural, consistent speech than baselines.
The framework surpasses speech-driven methods in quality.
Enhanced predictor-free guidance enables diverse, multi-conditional generation.
Abstract
How much can we infer about an emotional voice solely from an expressive face? This intriguing question holds great potential for applications such as virtual character dubbing and aiding individuals with expressive language disorders. Existing face-to-speech methods offer great promise in capturing identity characteristics but struggle to generate diverse vocal styles with emotional expression. In this paper, we explore a new task, termed emotional face-to-speech, aiming to synthesize emotional speech directly from expressive facial cues. To that end, we introduce DEmoFace, a novel generative framework that leverages a discrete diffusion transformer (DiT) with curriculum learning, built upon a multi-level neural audio codec. Specifically, we propose multimodal DiT blocks to dynamically align text and speech while tailoring vocal styles based on facial emotion and identity. To enhance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage, Discourse, Communication Strategies
MethodsDiffusion · ALIGN
